NVIDIA DGX
At-Home AI Stack — split-trust shared compute
Two clustered Nvidia DGX Spark nodes (arm64, Ubuntu 24.04) sharing a 256 GB unified memory pool through tensor parallelism (TP=2) over Ray on a 200 Gb/s direct-attach copper interconnect — but with split ownership. spark-01 is your node (your LiteLLM, your Open WebUI, your Hermes Agent, your n8n, your Tailscale). spark-02 is the client's node (their LiteLLM, their Open WebUI, their n8n, their Tailscale). Both LiteLLM proxies talk to the shared vLLM endpoint at spark-01:8000 over the DAC; neither application stack sees the other. Read the Trust model section before deploying — this architecture has specific properties at the API layer that you should understand explicitly.
Architecture
Two physically separate DGX Spark nodes share a single tensor-parallel vLLM cluster (TP=2 over Ray on a 200 Gb/s DAC link) — but each node runs its own independent application stack owned by a different party. The diagram shows three logical layers: application stacks (top, separate per owner), LiteLLM proxies (middle, one per side, separate keys and logs), and the shared compute pool (bottom, TP=2 across both nodes, served by the vLLM head on spark-01:8000). Tailscale sits as a separate overlay on each node — the DAC link is its own private hardware and does not traverse Tailscale.
Trust model
Read this before deploying. The split-trust architecture has specific properties at the API layer that you should understand explicitly. None of this is a new risk introduced by the cluster — it is just the same trust profile you accept any time you use a hosted inference API, made visible.
API-layer visibility — spark-01 sees all prompts
The vLLM head process runs on spark-01 and serves the OpenAI-compatible API on port 8000. Both LiteLLM proxies — yours and the client's — call this endpoint. That means the owner of spark-01 can, in principle, observe every raw prompt and every model output that crosses the API surface. This is structurally identical to the trust profile of any commercial hosted-inference provider (OpenAI, Anthropic, Together, etc.): the entity running the API server can see traffic at the API layer.
Tensor-layer isolation — spark-02 sees only floats
The Ray worker on spark-02 processes tensor activations, not text. It receives intermediate floating-point tensors over NCCL allreduce on the DAC link and contributes its share of the matrix multiplications. The client's node never sees readable prompts or completions; it only sees the mathematical operations its TP rank is responsible for. NCCL traffic on the DAC carries floats, not strings.
Application-layer isolation — fully separate stacks
Knowledge bases, chat history, RAG pipelines, vector indexes, API keys, request logs, and OAuth tokens are completely separate on each node. Your Open WebUI's database is on spark-01; the client's is on spark-02. Your LiteLLM master key is yours; the client's is the client's. Neither party has access to the other's application stack — there is no cross-mounted volume, no shared Postgres, no shared file system. The only thing that crosses the boundary is the inference call from the client's LiteLLM into spark-01:8000.
Network isolation — separate tailnets, private DAC
Each node joins its owner's Tailscale tailnet independently. ACLs on each tailnet are controlled by that owner. The DAC link (198.51.100.0/30) is private physical hardware between the two nodes — it is not routed through either Tailscale network and is not advertised on either tailnet. Tailscale carries application traffic only (clients reaching their own UIs); compute traffic stays on the DAC.
When this architecture is appropriate
- Both parties have a working relationship and have agreed to this arrangement.
- The data being sent for inference is not regulated (HIPAA / GDPR / SOC2 / PCI / etc.).
- Both parties accept a trust model equivalent to using any commercial hosted-inference API.
When additional agreements are required
- Either party handles regulated data — HIPAA, GDPR, SOC2, PCI-DSS, attorney-client privileged, or similar — in which case a written data processing agreement (DPA / BAA / equivalent) and audit controls are needed before traffic flows.
- Either party has contractual data handling requirements imposed by their own customers or regulators.
- The relationship is not pre-existing and the trust profile of "any hosted inference API" is not acceptable.
spark-01:8000 has no authentication. Tailscale ACLs and host firewall rules are what prevent the client from bypassing their LiteLLM and hitting the unauthenticated endpoint directly. See Step 06 (Tailscale) for the ACL configuration that enforces this.Hardware topology
| Node | Owner | Mgmt IP | DAC IP | Services |
|---|---|---|---|---|
spark-01 |
You (private) | 192.0.2.21 |
198.51.100.1 |
vLLM head (Ray master, TP rank 0), Your LiteLLM, Your Open WebUI, Your Hermes Agent, Your n8n, Your Tailscale |
spark-02 |
Client (separate ownership) | 192.0.2.22 |
198.51.100.2 |
vLLM Ray worker (TP rank 1), Client LiteLLM, Client Open WebUI, Client n8n, Client Tailscale |
Interconnects
- DAC interconnect —
enp1s0f0np0, MTU 9216, point-to-point198.51.100.0/30. Carries NCCL for tensor-parallel collectives, Ray control, and the client LiteLLM's inference calls intospark-01:8000. Not routed through either Tailscale network. - Mgmt interconnect —
192.0.2.0/24over RJ45, default routes. Used for SSH and node-bootstrap traffic during setup. - Tailscale (each owner) — each node independently joins its owner's tailnet. Application traffic (browser → Open WebUI, Telegram → Hermes, etc.) traverses Tailscale. The DAC link is never advertised onto either tailnet.
- SSH — passwordless both directions between
spark-01andspark-02at the mgmt IPs (required for the rsync step in Step 01). After setup, this can be locked down or removed.
Architecture principles
- Shared compute, split application. The vLLM cluster is the only shared resource. Application stacks above it (LiteLLM, Open WebUI, Hermes, n8n, Tailscale) are duplicated and independently owned.
- Both LiteLLMs hit the same vLLM endpoint. Your LiteLLM uses
http://localhost:8000/v1; client's LiteLLM useshttp://198.51.100.1:8000/v1over the DAC. Neither proxy goes through the other's stack. - Separate keys, separate logs, separate data. Each LiteLLM has its own master key; each Open WebUI has its own knowledge bases and chat history. Nothing is shared at the application layer.
- Tailscale is per-owner. Two separate tailnets, two separate ACL policies. Cross-tailnet traffic only happens if both owners explicitly configure it (which by default they do not).
- Single-instance per side. No clustered Open WebUI / clustered n8n on either node. HA modes are documented in the appendix only.
Network worksheet
Fill these in once. Every code block on this page that contains a matching placeholder (YOUR_NODE1_MGMT_IP, YOUR_USERNAME, etc.) will be live-substituted with the value you type — and a yellow highlight shows you what was filled in. Values are saved to your browser's localStorage so reloads keep them. Master keys, API keys, and other secrets are deliberately not in this worksheet — fill those into the relevant code blocks manually so they never touch localStorage.
spark-01 — your node
spark-02 — client node
Shared / per-host
YOUR_MASTER_KEY, YOUR_CLIENT_MASTER_KEY, and YOUR_BRAVE_API_KEY are intentionally not in this worksheet — fill those into the relevant code blocks by hand, and don't paste them into a browser-stored field. The worksheet only handles network identifiers and your username.TL;DR — one-shot setup scripts
Fill the table below, then run the matching script on each node. The scripts bundle every step in this guide — packages, Docker, RoCE/DAC checks, vLLM cluster image, model download + DAC rsync, LiteLLM + Postgres + Prisma, Open WebUI, n8n, Hermes, Tailscale, and host firewall — into a single idempotent run per node. spark-01 must complete through the image-copy/rsync stage before spark-02 can finish; the spark-02 script will pause and wait for the model weights to arrive.
sessionStorage — they vanish when you close the tab and are never written to disk. Generate fresh master keys with openssl rand -hex 24.- Open Access Controls and ensure the tag is declared in
tagOwners—"tag:owner": ["you@example.com"]for spark-01,"tag:client-ai": ["autogroup:admin"]for spark-02. The tag must exist before any device tries to advertise it. - Open Settings → Keys → Generate auth key. Toggle Tags on and select the matching tag from the list. Recommended: Reusable: no, Ephemeral: no, Tags: tag:owner (or
tag:client-ai).
--advertise-tags at tailscale up time. Tagged devices have key expiry automatically disabled, so the server won't drop off the tailnet on the 90-day timer.spark-01 — your node · secrets & choices
spark-02 — client node · secrets & choices
Model + recipe (shared)
Optional layers — uncheck to skip
Core stack — vLLM cluster, LiteLLM, Postgres, Docker, RoCE check — is always installed. Tailscale ships pre-installed on DGX Spark, so uncheck it if you've already configured it (or want to use your existing config). Unchecking a layer also removes its ufw port rule.
spark-01 layers
spark-02 layers
Run order — start both scripts in parallel; the dependencies are baked in as wait loops on each side:
- Start
setup-spark-02.shfirst (or simultaneously). spark-01 cannot push the vLLM image until Docker is installed on spark-02; the spark-01 script will wait for it. - Then start
setup-spark-01.shon spark-01. It installs packages, waits up to 5 min for spark-02's Docker to come online over the DAC, builds and pushes the ~19 GB vLLM image to spark-02, downloads the model on spark-01, rsyncs the ~60 GB of weights to spark-02 over the DAC, and brings up the rest of your stack. - Meanwhile,
setup-spark-02.shwaits up to 10 min for the image, then up to 30 min for the weights, then brings up the client LiteLLM, Open WebUI, n8n, Tailscale, and ufw on its side. - When both scripts return, run
hermes setupinteractively on spark-01 — the wizard prompts for the Telegram bot token and user ID, so it can't be safely scripted. - Apply the Tailscale ACLs from Step 06 in each owner's admin console.
#!/usr/bin/env bash
# DGX AI Stack — spark-01 (owner / Ray head) one-shot setup
# Bundles: bootstrap · GUI disable · RoCE check · vLLM image build + DAC copy ·
# model download + rsync · LiteLLM + Postgres + Prisma · Open WebUI · n8n ·
# Hermes install · Hermes dashboard · Tailscale · ufw.
# Re-running is safe: every step is guarded.
set -euo pipefail
# ───── worksheet-substituted config ─────
NODE1_MGMT_IP="YOUR_NODE1_MGMT_IP"
NODE1_DAC_IP="YOUR_NODE1_DAC_IP"
NODE2_MGMT_IP="YOUR_NODE2_MGMT_IP"
NODE2_DAC_IP="YOUR_NODE2_DAC_IP"
USERNAME="YOUR_USERNAME"
TAILNET_HOSTNAME="YOUR_TAILNET_HOSTNAME"
HF_TOKEN="YOUR_HF_TOKEN"
MASTER_KEY="YOUR_MASTER_KEY"
TS_AUTHKEY="YOUR_NODE1_TS_AUTHKEY"
MODEL_RECIPE="MODEL_RECIPE"
SERVED_MODEL_NAME="SERVED_MODEL_NAME"
HF_MODEL_REPO="HF_MODEL_REPO"
HF_MODEL_DIR="HF_MODEL_DIR"
log() { echo -e "\n\033[1;34m▶ $*\033[0m"; }
die() { echo "✗ $*" >&2; exit 1; }
[[ "$EUID" -ne 0 ]] || die "do not run as root — run as $USERNAME"
[[ "$(whoami)" == "$USERNAME" ]] || die "expected user $USERNAME, got $(whoami)"
# Fail-fast if critical worksheet placeholders weren't filled.
[[ "$MASTER_KEY" != "YOUR_MASTER_KEY" && -n "$MASTER_KEY" ]] || die "MASTER_KEY not set — fill the TL;DR secrets block in the guide"
[[ "$HF_TOKEN" != "YOUR_HF_TOKEN" && -n "$HF_TOKEN" ]] || die "HF_TOKEN not set — fill the TL;DR secrets block in the guide"
[[ "$NODE1_DAC_IP" != "YOUR_NODE1_DAC_IP" && "$NODE2_DAC_IP" != "YOUR_NODE2_DAC_IP" ]] || die "fill the DAC IPs in the network worksheet"
[[ "$USERNAME" != "YOUR_USERNAME" ]] || die "fill USERNAME in the network worksheet"
# ───── 1. base packages + docker ─────
log "apt — base packages"
sudo apt-get update
sudo apt-get install -y ca-certificates curl gnupg lsb-release git rsync ufw \
nload jq python3-pip python3-venv openssh-client openssl
if ! command -v docker >/dev/null 2>&1; then
curl -fsSL https://get.docker.com | sudo sh
fi
sudo systemctl enable --now docker
id -nG "$USERNAME" | grep -qw docker || sudo usermod -aG docker "$USERNAME"
# Re-exec under the docker group if this shell doesn't have it yet
if ! id -nG | grep -qw docker; then
log "re-execing under docker group"
SCRIPT="$(readlink -f "$0")"
exec sg docker -c "bash '$SCRIPT' $*"
fi
# ───── 2. /etc/hosts — anchor hostnames to mgmt IPs ─────
log "/etc/hosts entries"
sudo sed -i '/[[:space:]]spark-01$/d;/[[:space:]]spark-02$/d' /etc/hosts
echo "$NODE1_MGMT_IP spark-01" | sudo tee -a /etc/hosts >/dev/null
echo "$NODE2_MGMT_IP spark-02" | sudo tee -a /etc/hosts >/dev/null
# ───── 3. headless: stop and disable GDM ─────
log "disable desktop"
sudo systemctl set-default multi-user.target
sudo systemctl stop gdm 2>/dev/null || true
sudo systemctl stop gnome-remote-desktop 2>/dev/null || true
sudo systemctl disable gnome-remote-desktop 2>/dev/null || true
# ───── 4. RoCE / DAC sanity ─────
log "RoCE check"
ibdev2netdev || die "ibdev2netdev failed — RoCE drivers missing"
ls /dev/infiniband/ >/dev/null || die "/dev/infiniband missing — fix RoCE plumbing first"
# ───── 5. passwordless SSH to spark-02 ─────
log "SSH access to spark-02 (mgmt + DAC)"
[[ -f "$HOME/.ssh/id_ed25519" ]] || ssh-keygen -t ed25519 -N "" -f "$HOME/.ssh/id_ed25519"
# Probe with BatchMode — succeeds silently if a key is already authorized.
probe_ssh() {
ssh -o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=accept-new \
"$USERNAME@$1" 'exit 0' >/dev/null 2>&1
}
# Deploy the key to one target, but only if probing fails.
# Reads password from /dev/tty so this works even when stdin is a pipe
# (e.g. `curl ... | bash`). Falls through with a warning if the user
# skips or if password auth is disabled on spark-02.
deploy_key() {
local host="$1" label="$2"
if probe_ssh "$host"; then
echo " ✓ $label ($host) — key already authorized, no password needed"
return 0
fi
echo
echo " → $label ($host) — SSH key not yet authorized."
echo " ssh-copy-id will prompt for $USERNAME's password on spark-02."
echo " Press Ctrl-D to skip if password auth is disabled there (you will then"
echo " need to deploy ~/.ssh/id_ed25519.pub to spark-02 manually before re-running)."
if [[ -r /dev/tty ]]; then
ssh-copy-id -o StrictHostKeyChecking=accept-new "$USERNAME@$host" </dev/tty \
|| echo " (ssh-copy-id failed or was skipped for $label)"
else
echo " (no controlling terminal — cannot prompt for password; deploy key manually)"
fi
if probe_ssh "$host"; then
echo " ✓ $label ($host) — key now authorized"
else
echo " ✗ $label ($host) — still no passwordless SSH (will retry in step 6)"
fi
}
deploy_key "$NODE2_MGMT_IP" "spark-02 mgmt"
deploy_key "$NODE2_DAC_IP" "spark-02 DAC"
# ───── 6. wait for spark-02 to have Docker (build-and-copy.sh needs `docker load` there) ─────
log "waiting for spark-02 Docker over DAC ($NODE2_DAC_IP)"
SSH2="ssh -o BatchMode=yes -o StrictHostKeyChecking=accept-new $USERNAME@$NODE2_DAC_IP"
for i in {1..60}; do # 5 min cap
$SSH2 'command -v docker' >/dev/null 2>&1 && break
[[ $((i % 6)) -eq 0 ]] && echo " …spark-02 not ready (${i}/60) — start setup-spark-02.sh there if you haven't"
sleep 5
done
$SSH2 'command -v docker' >/dev/null 2>&1 || die "spark-02 unreachable or Docker missing — run setup-spark-02.sh on the other node first"
# ───── 7. vLLM image — clone + build + copy to spark-02 over DAC ─────
log "spark-vllm-docker"
cd "$HOME"
[[ -d spark-vllm-docker ]] || git clone https://github.com/eugr/spark-vllm-docker.git
cd spark-vllm-docker
./build-and-copy.sh --tf5 --copy-to "$NODE2_DAC_IP"
# ───── 8. autodiscovery → .env (accept defaults) ─────
log "discovery"
# Feed blank answers to all prompts; trailing `|| true` survives SIGPIPE under pipefail.
{ printf '\n%.0s' {1..20} | ./run-recipe.sh --discover; } || true
grep -q '^CONTAINER_HF_TOKEN=' .env || echo "CONTAINER_HF_TOKEN=$HF_TOKEN" >> .env
# ───── 9. model weights — download then rsync to spark-02 over DAC ─────
log "huggingface download"
pip3 install --break-system-packages 'huggingface_hub[cli]' >/dev/null
export PATH="$HOME/.local/bin:$PATH" # so `hf` resolves on this run
HF_TOKEN="$HF_TOKEN" hf download "$HF_MODEL_REPO" \
--local-dir "$HOME/.cache/huggingface/hub/$HF_MODEL_DIR/"
log "rsync model → spark-02 over DAC ($NODE1_DAC_IP → $NODE2_DAC_IP)"
$SSH2 "mkdir -p ~/.cache/huggingface/hub/$HF_MODEL_DIR"
rsync -avP -e "ssh -b $NODE1_DAC_IP" \
"$HOME/.cache/huggingface/hub/$HF_MODEL_DIR/" \
"$NODE2_DAC_IP:.cache/huggingface/hub/$HF_MODEL_DIR/"
# ───── 10. LiteLLM + Prisma ─────
log "pip — litellm[proxy] + prisma"
pip3 install --break-system-packages 'litellm[proxy]' prisma >/dev/null
grep -q 'HOME/.local/bin' "$HOME/.bashrc" || \
echo 'export PATH="$HOME/.local/bin:$PATH"' >> "$HOME/.bashrc"
# ───── 11. spark-ai-stack — postgres compose ─────
log "spark-ai-stack directory + postgres"
mkdir -p "$HOME/spark-ai-stack/logs" "$HOME/workspace"
cd "$HOME/spark-ai-stack"
cat > docker-compose.yml <<'YAML'
services:
litellm-db:
image: postgres:16
container_name: litellm-db
restart: unless-stopped
environment:
- POSTGRES_USER=litellm
- POSTGRES_PASSWORD=litellm
- POSTGRES_DB=litellm
volumes:
- litellm_db:/var/lib/postgresql/data
ports:
- "5432:5432"
volumes:
litellm_db:
YAML
docker compose up -d litellm-db
for i in {1..40}; do
docker exec litellm-db pg_isready -U litellm >/dev/null 2>&1 && break
sleep 2
done
docker exec litellm-db pg_isready -U litellm >/dev/null 2>&1 \
|| die "postgres (litellm-db) never became ready — check 'docker logs litellm-db'"
# ───── 12. litellm-config.yaml (owner) ─────
log "litellm-config.yaml"
cat > "$HOME/spark-ai-stack/litellm-config.yaml" <<EOF
model_list:
- model_name: Qwen3.5-122B-Non-Reasoning
litellm_params:
model: openai/$SERVED_MODEL_NAME
api_base: http://localhost:8000/v1
api_key: "not-needed"
max_tokens: 8192
extra_body:
chat_template_kwargs:
enable_thinking: false
model_info:
supports_function_calling: true
supports_tool_choice: true
max_context_window: 262144
max_input_tokens: 229376
max_output_tokens: 8192
- model_name: Qwen3.5-122B-Reasoning
litellm_params:
model: openai/$SERVED_MODEL_NAME
api_base: http://localhost:8000/v1
api_key: "not-needed"
max_tokens: 32768
extra_body:
chat_template_kwargs:
enable_thinking: true
model_info:
supports_function_calling: true
supports_tool_choice: true
max_context_window: 262144
max_input_tokens: 229376
max_output_tokens: 32768
litellm_settings:
verbose: true
store_model_in_db: true
default_system_message: "You are a highly capable AI assistant. Be direct, accurate, and concise. Answer immediately without preamble. For coding: produce complete, working code. State definitive answers first then explain."
router_settings:
num_retries: 0
timeout: 600
general_settings:
master_key: $MASTER_KEY
database_url: "postgresql://litellm:litellm@localhost:5432/litellm"
mcp_settings:
allow_all_keys: true
EOF
log "prisma db push"
PRISMA_SCHEMA="$(python3 -c 'import litellm,os;print(os.path.join(os.path.dirname(litellm.__file__),"proxy","schema.prisma"))')"
DATABASE_URL="postgresql://litellm:litellm@localhost:5432/litellm" \
"$HOME/.local/bin/prisma" db push --schema "$PRISMA_SCHEMA"
# ───── 13. systemd — vllm-cluster + litellm ─────
log "systemd units"
sudo tee /etc/systemd/system/vllm-cluster.service >/dev/null <<EOF
[Unit]
Description=vLLM Cluster - $SERVED_MODEL_NAME
After=network-online.target docker.service
Wants=network-online.target
Requires=docker.service
[Service]
Type=oneshot
RemainAfterExit=yes
User=$USERNAME
WorkingDirectory=/home/$USERNAME/spark-vllm-docker
ExecStartPre=/bin/sleep 30
ExecStart=/home/$USERNAME/spark-vllm-docker/run-recipe.sh $MODEL_RECIPE -d -- --served-model-name $SERVED_MODEL_NAME --gpu-memory-utilization 0.80
TimeoutStartSec=600
TimeoutStopSec=60
[Install]
WantedBy=multi-user.target
EOF
sudo tee /etc/systemd/system/litellm.service >/dev/null <<EOF
[Unit]
Description=LiteLLM Proxy (owner)
After=network.target docker.service
Wants=docker.service
[Service]
Type=simple
User=$USERNAME
WorkingDirectory=/home/$USERNAME/spark-ai-stack
ExecStart=/home/$USERNAME/.local/bin/litellm --config litellm-config.yaml --port 8001 --host 0.0.0.0
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable vllm-cluster litellm
# --no-block: oneshot units block on ExecStart; model load takes minutes.
sudo systemctl start --no-block vllm-cluster
sleep 10
sudo systemctl start litellm
# >>>OPT:openwebui
# ───── 14. Open WebUI ─────
log "open-webui container"
docker rm -f open-webui 2>/dev/null || true
docker run -d --name open-webui --restart unless-stopped \
-p 8080:8080 \
-v open-webui:/app/backend/data \
-e OPENAI_API_BASE_URL="http://host.docker.internal:8001/v1" \
-e OPENAI_API_KEY="$MASTER_KEY" \
-e WEBUI_AUTH=True \
-e ENABLE_OLLAMA_API=False \
--add-host=host.docker.internal:host-gateway \
ghcr.io/open-webui/open-webui:main
# <<>>OPT:n8n
# ───── 15. n8n compose ─────
log "n8n compose"
# Fall back to plain 'spark-01' if the worksheet hostname is blank or unfilled.
WEBHOOK_HOST="$TAILNET_HOSTNAME"
[[ -z "$WEBHOOK_HOST" || "$WEBHOOK_HOST" == "YOUR_TAILNET_HOSTNAME" ]] && WEBHOOK_HOST=spark-01
cat > "$HOME/spark-ai-stack/n8n.yml" <<EOF
services:
n8n:
image: n8nio/n8n:latest
container_name: n8n
restart: unless-stopped
ports:
- "5678:5678"
environment:
- N8N_HOST=0.0.0.0
- N8N_PORT=5678
- N8N_PROTOCOL=http
- WEBHOOK_URL=http://$WEBHOOK_HOST:5678/
- N8N_SECURE_COOKIE=false
- NODE_ENV=production
- GENERIC_TIMEZONE=America/Los_Angeles
volumes:
- n8n_data:/home/node/.n8n
extra_hosts:
- "host.docker.internal:host-gateway"
volumes:
n8n_data:
EOF
docker compose -f "$HOME/spark-ai-stack/n8n.yml" up -d
# <<>>OPT:hermes
# ───── 16. Hermes Agent + Dashboard ─────
log "hermes install (run 'hermes setup' interactively after this script)"
sudo apt install -y ripgrep ffmpeg
curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash || true
# Pre-create state dirs.
mkdir -p "$HOME/.hermes" "$HOME/workspace"
sudo tee /etc/systemd/system/hermes-dashboard.service >/dev/null <<EOF
[Unit]
Description=Hermes Agent Dashboard
After=network.target hermes-gateway.service
Wants=hermes-gateway.service
[Service]
Type=simple
User=$USERNAME
ExecStart=/home/$USERNAME/.local/bin/hermes dashboard --port 9119 --host 0.0.0.0 --insecure --no-open
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable hermes-dashboard
sudo systemctl start hermes-dashboard
# <<>>OPT:tailscale
# ───── 17. Tailscale (owner tailnet) ─────
# Prereqs in the OWNER tailnet:
# 1. tag:owner must be declared in tagOwners in the policy file.
# 2. The auth key must be generated WITH tag:owner selected (Admin → Settings → Keys).
# DGX Spark may ship with Tailscale pre-installed and possibly pre-joined; install.sh
# is idempotent and upgrades in place, and `tailscale logout` ensures a clean re-join.
log "tailscale install + up"
curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale logout 2>/dev/null || true
if [[ -n "$TS_AUTHKEY" && "$TS_AUTHKEY" != "YOUR_NODE1_TS_AUTHKEY" ]]; then
sudo tailscale up --auth-key="$TS_AUTHKEY" --hostname=spark-01 --advertise-tags=tag:owner
else
echo " (TS_AUTHKEY blank — run: sudo tailscale up --hostname=spark-01 --advertise-tags=tag:owner)"
fi
# <<>>OPT:ufw
# ───── 18. host firewall ─────
log "ufw"
sudo ufw --force default deny incoming
sudo ufw default allow outgoing
# >>>OPT:tailscale
sudo ufw allow in on tailscale0 to any port 8001 proto tcp
# >>>OPT:openwebui
sudo ufw allow in on tailscale0 to any port 8080 proto tcp
# <<>>OPT:n8n
sudo ufw allow in on tailscale0 to any port 5678 proto tcp
# <<>>OPT:hermes
sudo ufw allow in on tailscale0 to any port 9119 proto tcp
# <<
#!/usr/bin/env bash
# DGX AI Stack — spark-02 (client / Ray worker) one-shot setup
# Bundles: bootstrap · GUI disable · RoCE check · wait-for-image · wait-for-weights ·
# client LiteLLM + Postgres + Prisma · client Open WebUI · client n8n ·
# Tailscale (client tailnet) · ufw.
# Re-running is safe: every step is guarded.
set -euo pipefail
# ───── worksheet-substituted config ─────
NODE1_MGMT_IP="YOUR_NODE1_MGMT_IP"
NODE1_DAC_IP="YOUR_NODE1_DAC_IP"
NODE2_MGMT_IP="YOUR_NODE2_MGMT_IP"
NODE2_DAC_IP="YOUR_NODE2_DAC_IP"
USERNAME="YOUR_USERNAME"
CLIENT_TAILNET_HOSTNAME="CLIENT_TAILNET_HOSTNAME"
CLIENT_MASTER_KEY="YOUR_CLIENT_MASTER_KEY"
TS_AUTHKEY="YOUR_NODE2_TS_AUTHKEY"
SERVED_MODEL_NAME="SERVED_MODEL_NAME"
HF_MODEL_DIR="HF_MODEL_DIR"
log() { echo -e "\n\033[1;35m▶ $*\033[0m"; }
die() { echo "✗ $*" >&2; exit 1; }
[[ "$EUID" -ne 0 ]] || die "do not run as root — run as $USERNAME"
[[ "$(whoami)" == "$USERNAME" ]] || die "expected user $USERNAME, got $(whoami)"
# Fail-fast if critical worksheet placeholders weren't filled.
[[ "$CLIENT_MASTER_KEY" != "YOUR_CLIENT_MASTER_KEY" && -n "$CLIENT_MASTER_KEY" ]] || die "CLIENT_MASTER_KEY not set — fill the TL;DR secrets block in the guide"
[[ "$NODE1_DAC_IP" != "YOUR_NODE1_DAC_IP" && "$NODE2_DAC_IP" != "YOUR_NODE2_DAC_IP" ]] || die "fill the DAC IPs in the network worksheet"
[[ "$USERNAME" != "YOUR_USERNAME" ]] || die "fill USERNAME in the network worksheet"
# ───── 1. base packages + docker ─────
log "apt — base packages"
sudo apt-get update
sudo apt-get install -y ca-certificates curl gnupg lsb-release git rsync ufw \
nload jq python3-pip python3-venv openssh-server openssl netcat-openbsd
if ! command -v docker >/dev/null 2>&1; then
curl -fsSL https://get.docker.com | sudo sh
fi
sudo systemctl enable --now docker ssh
id -nG "$USERNAME" | grep -qw docker || sudo usermod -aG docker "$USERNAME"
# Re-exec under the docker group if this shell doesn't have it yet
if ! id -nG | grep -qw docker; then
log "re-execing under docker group"
SCRIPT="$(readlink -f "$0")"
exec sg docker -c "bash '$SCRIPT' $*"
fi
# ───── 2. /etc/hosts ─────
log "/etc/hosts"
sudo sed -i '/[[:space:]]spark-01$/d;/[[:space:]]spark-02$/d' /etc/hosts
echo "$NODE1_MGMT_IP spark-01" | sudo tee -a /etc/hosts >/dev/null
echo "$NODE2_MGMT_IP spark-02" | sudo tee -a /etc/hosts >/dev/null
# ───── 3. headless ─────
log "disable desktop"
sudo systemctl set-default multi-user.target
sudo systemctl stop gdm 2>/dev/null || true
sudo systemctl stop gnome-remote-desktop 2>/dev/null || true
sudo systemctl disable gnome-remote-desktop 2>/dev/null || true
# ───── 4. RoCE / DAC sanity ─────
log "RoCE check"
ibdev2netdev || die "ibdev2netdev failed — RoCE drivers missing"
ls /dev/infiniband/ >/dev/null || die "/dev/infiniband missing"
# ───── 5. wait for spark-01 to push the vLLM image (~19 GB over 200 Gb/s DAC ≈ 1–2 min) ─────
log "waiting for vllm-node-tf5 image (pushed from spark-01 by build-and-copy.sh)"
for i in {1..120}; do # 120 × 5s = 10 min cap
docker images --format '{{.Repository}}:{{.Tag}}' | grep -q '^vllm-node-tf5:latest$' && break
[[ $((i % 12)) -eq 0 ]] && echo " …still waiting (${i}/120)"
sleep 5
done
docker images | grep -q vllm-node-tf5 || die "vllm-node-tf5 image never arrived from spark-01 — re-run build-and-copy.sh there"
# ───── 6. wait for model weights via rsync from spark-01 (~60 GB over DAC ≈ 4–8 min) ─────
log "waiting for HF model weights at ~/.cache/huggingface/hub/$HF_MODEL_DIR"
mkdir -p "$HOME/.cache/huggingface/hub"
for i in {1..180}; do # 180 × 10s = 30 min cap
if [[ -d "$HOME/.cache/huggingface/hub/$HF_MODEL_DIR" ]] && \
[[ -n "$(ls -A "$HOME/.cache/huggingface/hub/$HF_MODEL_DIR" 2>/dev/null)" ]]; then
# let rsync settle: same size across two samples 10s apart
sz1=$(du -sb "$HOME/.cache/huggingface/hub/$HF_MODEL_DIR" 2>/dev/null | awk '{print $1}')
sleep 10
sz2=$(du -sb "$HOME/.cache/huggingface/hub/$HF_MODEL_DIR" 2>/dev/null | awk '{print $1}')
[[ "$sz1" == "$sz2" && "$sz1" -gt 1000000000 ]] && break
fi
[[ $((i % 6)) -eq 0 ]] && echo " …still waiting (${i}/180, size so far: $(du -sh "$HOME/.cache/huggingface/hub/$HF_MODEL_DIR" 2>/dev/null | awk '{print $1}'))"
sleep 10
done
# ───── 7. LiteLLM + Prisma ─────
log "pip — litellm[proxy] + prisma"
pip3 install --break-system-packages 'litellm[proxy]' prisma >/dev/null
grep -q 'HOME/.local/bin' "$HOME/.bashrc" || \
echo 'export PATH="$HOME/.local/bin:$PATH"' >> "$HOME/.bashrc"
export PATH="$HOME/.local/bin:$PATH"
# ───── 8. spark-ai-stack + standalone postgres ─────
log "spark-ai-stack + postgres"
mkdir -p "$HOME/spark-ai-stack/logs"
cd "$HOME/spark-ai-stack"
docker rm -f litellm-db 2>/dev/null || true
docker run -d --name litellm-db --restart unless-stopped \
-e POSTGRES_USER=litellm \
-e POSTGRES_PASSWORD=litellm \
-e POSTGRES_DB=litellm \
-p 5432:5432 \
-v litellm_db:/var/lib/postgresql/data \
postgres:16
for i in {1..40}; do
docker exec litellm-db pg_isready -U litellm >/dev/null 2>&1 && break
sleep 2
done
docker exec litellm-db pg_isready -U litellm >/dev/null 2>&1 \
|| die "postgres (litellm-db) never became ready — check 'docker logs litellm-db'"
# ───── 9. client litellm-config.yaml (points DAC → vLLM) ─────
log "client litellm-config.yaml"
cat > "$HOME/spark-ai-stack/litellm-config.yaml" <<EOF
model_list:
- model_name: Qwen3.5-122B-Non-Reasoning
litellm_params:
model: openai/$SERVED_MODEL_NAME
api_base: http://$NODE1_DAC_IP:8000/v1
api_key: "not-needed"
max_tokens: 8192
extra_body:
chat_template_kwargs:
enable_thinking: false
model_info:
supports_function_calling: true
supports_tool_choice: true
max_context_window: 262144
max_input_tokens: 229376
max_output_tokens: 8192
- model_name: Qwen3.5-122B-Reasoning
litellm_params:
model: openai/$SERVED_MODEL_NAME
api_base: http://$NODE1_DAC_IP:8000/v1
api_key: "not-needed"
max_tokens: 32768
extra_body:
chat_template_kwargs:
enable_thinking: true
model_info:
supports_function_calling: true
supports_tool_choice: true
max_context_window: 262144
max_input_tokens: 229376
max_output_tokens: 32768
litellm_settings:
verbose: true
store_model_in_db: true
log_config:
level: INFO
format: json
filepath: /home/$USERNAME/spark-ai-stack/logs/litellm.log
router_settings:
num_retries: 0
timeout: 600
general_settings:
master_key: $CLIENT_MASTER_KEY
database_url: "postgresql://litellm:litellm@localhost:5432/litellm"
EOF
log "prisma db push"
PRISMA_SCHEMA="$(python3 -c 'import litellm,os;print(os.path.join(os.path.dirname(litellm.__file__),"proxy","schema.prisma"))')"
DATABASE_URL="postgresql://litellm:litellm@localhost:5432/litellm" \
"$HOME/.local/bin/prisma" db push --schema "$PRISMA_SCHEMA"
# ───── 10. client litellm systemd ─────
log "systemd — litellm (client)"
sudo tee /etc/systemd/system/litellm.service >/dev/null <<EOF
[Unit]
Description=LiteLLM Proxy (client)
After=network.target docker.service
Wants=docker.service
[Service]
Type=simple
User=$USERNAME
WorkingDirectory=/home/$USERNAME/spark-ai-stack
ExecStart=/home/$USERNAME/.local/bin/litellm --config litellm-config.yaml --port 8001 --host 0.0.0.0
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now litellm
# >>>OPT:openwebui
# ───── 11. client Open WebUI ─────
log "client open-webui"
docker rm -f open-webui 2>/dev/null || true
docker run -d --name open-webui --restart unless-stopped \
-p 8080:8080 \
-v open-webui:/app/backend/data \
-e OPENAI_API_BASE_URL="http://host.docker.internal:8001/v1" \
-e OPENAI_API_KEY="$CLIENT_MASTER_KEY" \
-e WEBUI_AUTH=True \
-e ENABLE_OLLAMA_API=False \
--add-host=host.docker.internal:host-gateway \
ghcr.io/open-webui/open-webui:main
# <<>>OPT:n8n
# ───── 12. client n8n ─────
log "client n8n"
# Fall back to plain 'spark-02' if the worksheet field is blank or unfilled.
WEBHOOK_HOST="$CLIENT_TAILNET_HOSTNAME"
[[ -z "$WEBHOOK_HOST" || "$WEBHOOK_HOST" == "CLIENT_TAILNET_HOSTNAME" ]] && WEBHOOK_HOST=spark-02
docker rm -f n8n 2>/dev/null || true
docker run -d --name n8n --restart unless-stopped \
-p 5678:5678 \
-e N8N_HOST=0.0.0.0 \
-e N8N_PORT=5678 \
-e N8N_PROTOCOL=http \
-e WEBHOOK_URL="http://$WEBHOOK_HOST:5678/" \
-e N8N_SECURE_COOKIE=false \
-e NODE_ENV=production \
-v n8n_data:/home/node/.n8n \
--add-host=host.docker.internal:host-gateway \
n8nio/n8n:latest
# <<>>OPT:tailscale
# ───── 13. Tailscale (client tailnet) ─────
# Prereqs in the CLIENT tailnet:
# 1. tag:client-ai must be declared in tagOwners in the client's policy file.
# 2. The auth key must be generated WITH tag:client-ai selected (Admin → Settings → Keys).
# DGX Spark may ship with Tailscale pre-installed and possibly pre-joined; install.sh
# is idempotent and upgrades in place, and `tailscale logout` ensures a clean re-join.
log "tailscale install + up (client tailnet)"
curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale logout 2>/dev/null || true
if [[ -n "$TS_AUTHKEY" && "$TS_AUTHKEY" != "YOUR_NODE2_TS_AUTHKEY" ]]; then
sudo tailscale up --auth-key="$TS_AUTHKEY" --hostname=spark-02 --advertise-tags=tag:client-ai
else
echo " (TS_AUTHKEY blank — run: sudo tailscale up --hostname=spark-02 --advertise-tags=tag:client-ai)"
fi
# <<>>OPT:ufw
# ───── 14. host firewall ─────
log "ufw"
sudo ufw --force default deny incoming
sudo ufw default allow outgoing
# >>>OPT:tailscale
sudo ufw allow in on tailscale0 to any port 8001 proto tcp
# >>>OPT:openwebui
sudo ufw allow in on tailscale0 to any port 8080 proto tcp
# <<>>OPT:n8n
sudo ufw allow in on tailscale0 to any port 5678 proto tcp
# <<
scp the scripts, click download, push the file to a private gist or your own host, then on each node: curl -fsSL https://<your-host>/setup-spark-01.sh | bash. The downloaded scripts have the worksheet values baked in, so anyone who can read the URL can also read your secrets — use a one-time gist or self-host behind auth, and rotate the master keys after the run.Prerequisites
- Two Nvidia DGX Spark nodes — Grace CPU, GB10 GPU, arm64/aarch64, each running Ubuntu 24.04
- Each node referred to as
spark-01(your node) andspark-02(client node) — substitute your own hostnames - Both parties have read and accepted the Trust model section above
- Docker installed and enabled on both nodes:
sudo systemctl enable docker - Your Linux user added to the
dockergroup on both nodes:sudo usermod -aG docker YOUR_USERNAME && newgrp docker - 200GbE DAC cable between the two nodes (included in the dual-node DGX Spark bundle) — interface
enp1s0f0np0, MTU 9216, point-to-point /30. NCCL is configured to use this interface for all tensor parallel all-reduce communication. - Mgmt LAN reachability between both nodes (1 GbE RJ45 with default routes)
- Passwordless SSH both directions (
spark-01 ↔ spark-02) — required for the HF cache rsync in Step 01 - Mgmt-IP entries in
/etc/hostson both nodes so hostnames resolve to mgmt addresses, not the DAC IP (commands below) - Replace
YOUR_USERNAMEwith your Linux username throughout - Replace
YOUR_NODE1_MGMT_IP/YOUR_NODE2_MGMT_IPwith each node's mgmt IP, andYOUR_NODE1_DAC_IP/YOUR_NODE2_DAC_IPwith each node's DAC IP
Bootstrap on both nodes — /etc/hosts and docker group
By default, the hostname of each node resolves to its DAC IP (198.51.100.x), not the mgmt IP. SSH from one node to the other by hostname will fail until you anchor the hostnames to mgmt IPs explicitly.
#### Run on both spark-01 AND spark-02
# Add mgmt-IP entries for both nodes
echo "YOUR_NODE1_MGMT_IP spark-01" | sudo tee -a /etc/hosts
echo "YOUR_NODE2_MGMT_IP spark-02" | sudo tee -a /etc/hosts
# Add your user to the docker group (then re-login or use newgrp)
sudo usermod -aG docker YOUR_USERNAME
newgrp docker
# Verify SSH by hostname both directions
ssh spark-01 hostname # from spark-02
ssh spark-02 hostname # from spark-01
/etc/hosts step, the rsync of the Hugging Face cache between nodes (Step 01) and any later ssh spark-0X command will silently target the DAC interface — which won't have sshd bound to it unless you've changed defaults. The symptom is a "connection refused" or hang.Disable the desktop environment — both nodes
The DGX Spark ships with Ubuntu Desktop. Both nodes operate as headless servers with no monitor connected — stop and disable the display stack before running any workloads. This frees GPU memory and eliminates background display scheduling noise. The multi-user target persists across reboots. To start the GUI temporarily if needed: sudo systemctl start graphical.target
Pre-step — verify SSH access to both nodes before disabling the GUI
Run from your Mac or any external machine on the same network. Both commands must return successfully before continuing. If either fails, resolve SSH access before proceeding — once the display manager is stopped you will have no local GUI fallback.
ssh YOUR_USERNAME@YOUR_NODE1_MGMT_IP "echo spark-01 SSH OK"
ssh YOUR_USERNAME@YOUR_NODE2_MGMT_IP "echo spark-02 SSH OK"
#### Run on BOTH spark-01 and spark-02
sudo systemctl stop gdm
sudo systemctl set-default multi-user.target
sudo systemctl stop gnome-remote-desktop
sudo systemctl disable gnome-remote-desktop
# Verify — should return no output
ps aux | grep -E "Xorg|gnome" | grep -v grep
vLLM clustered — TP=2 over Ray on RoCE/RDMA
vLLM is the only clustered service. The model runs with tensor-parallel size 2: spark-01 hosts the Ray master and the vLLM head process; spark-02 hosts a Ray worker. NCCL collectives flow over RoCE/RDMA on the DAC link — not TCP sockets. The earlier hand-built vllm-spark:26.04 approach is obsolete: it ran NCCL over TCP/IP (no /dev/infiniband passthrough, no NCCL_IB_HCA), so steady-state throughput sat at 2–3 tok/s instead of the 45 tok/s the hardware is capable of. The community-maintained eugr/spark-vllm-docker stack ships pre-built SM121a (Blackwell) wheels and wires up the RoCE/infiniband passthrough correctly. Use it.
Production model
| Track | Model | Notes |
|---|---|---|
| Production (default) | Qwen/Qwen3.5-122B-A10B-FP8 |
The intended daily driver — 122B / A10B MoE, official Qwen FP8. ~61 GB resident per node. Includes an MTP head (qwen3_next_mtp) but speculative decoding is disabled — unstable in vLLM v0.19.0 (HTTP 500 on requests with standard sampling params). Steady-state throughput ~45 tok/s at TP=2 without MTP. 262K context window. Recipe: qwen3.5-122b-fp8. |
| Bootstrap fallback | Qwen/Qwen3.6-35B-A3B-FP8 |
35B MoE / A3B activation, FP8 quantized. Useful for fast iteration on cluster wiring (Ray, NCCL, RoCE) before committing to the longer 122B load. |
| Tested but does not fit / not supported | Qwen/Qwen3-235B-A22B-FP8 · Sehyo/Qwen3.5-122B-A10B-NVFP4 |
235B FP8 ≈ 117.5 GB per node — no room for KV cache; Ray OOMs. NVFP4 is single-node only on DGX Spark today; multi-node NVFP4 fails at cluster launch. See "Other known issues". |
Step 1a — Verify RoCE interfaces on both nodes
The DAC link presents two RoCE devices per port-twin. Only the active port (port 0 on each card) is used; the second port stays Down on a standard DAC pair.
#### Run on BOTH spark-01 and spark-02
ibdev2netdev
ls /dev/infiniband/
ibdev2netdev on each node:rocep1s0f0 → enp1s0f0np0 (Up) ← active, port 0roceP2p1s0f0 → enP2p1s0f0np0 (Up) ← active, port 0 twinrocep1s0f1 → enp1s0f1np1 (Down) ← DAC only uses port 0roceP2p1s0f1 → enP2p1s0f1np1 (Down)Expected from
ls /dev/infiniband/: rdma_cm umad0 umad1 umad2 umad3 uverbs0 uverbs1 uverbs2 uverbs3If
/dev/infiniband is missing or either port-0 device is Down, fix the RoCE plumbing on the host before continuing — RDMA passthrough into the container can only work if the devices are present and Up. A working NCCL RDMA path gives ~45 tok/s; TCP fallback gives ~2–3 tok/s.
Step 1b — Clone and build the image (spark-01)
The build pulls the pre-built SM121a (Blackwell) wheels — no compilation required — and the helper script auto-copies the resulting image to spark-02 over the DAC. The --tf5 flag is required for the container variant used by the production recipe. Total build time ~30 minutes (mostly image pull + cross-node copy).
git clone https://github.com/eugr/spark-vllm-docker.git
cd spark-vllm-docker
./build-and-copy.sh --tf5 --copy-to YOUR_NODE2_DAC_IP
docker images | grep vllm-node-tf5 on both nodes should show vllm-node-tf5:latest at ~19 GB.Step 1c — Autodiscovery (spark-01)
The discovery step detects the local and peer DAC IPs, identifies the RoCE twins for NCCL_IB_HCA, and writes the values to .env. Every subsequent run-recipe.sh invocation reads from this file — get it right once and the cluster launch becomes a one-liner.
cd ~/spark-vllm-docker
./run-recipe.sh --discover
Accept the prompts. The resulting .env file should contain:
CLUSTER_NODES=YOUR_NODE1_DAC_IP,YOUR_NODE2_DAC_IP
COPY_HOSTS=YOUR_NODE2_DAC_IP
LOCAL_IP=YOUR_NODE1_DAC_IP
ETH_IF=enp1s0f0np0
IB_IF=rocep1s0f0,roceP2p1s0f0
CONTAINER_HF_TOKEN=<your_hf_token>
Append the Hugging Face token after discovery — the discovery prompt won't ask for it:
echo "CONTAINER_HF_TOKEN=YOUR_HF_TOKEN" >> ~/spark-vllm-docker/.env
Step 1d — Download the model and rsync to spark-02 over the DAC
Both nodes need the model weights resident locally. Pull on spark-01 first, then rsync to spark-02 over the DAC — keeps the ~60 GB transfer off the mgmt LAN.
#### spark-01
# Run as YOUR_USERNAME — not root
HF_TOKEN=YOUR_HF_TOKEN hf download Qwen/Qwen3.5-122B-A10B-FP8 \
--local-dir ~/.cache/huggingface/hub/models--Qwen--Qwen3.5-122B-A10B-FP8/
# Bind explicitly to the local DAC IP so the rsync runs over enp1s0f0np0
rsync -avP \
-e "ssh -b YOUR_NODE1_DAC_IP" \
~/.cache/huggingface/hub/models--Qwen--Qwen3.5-122B-A10B-FP8/ \
YOUR_NODE2_DAC_IP:~/.cache/huggingface/hub/models--Qwen--Qwen3.5-122B-A10B-FP8/
nload enp1s0f0np0 on spark-02 in another shell during the rsync.Step 1e — Launch the cluster
The launch script reads .env, starts the head container on spark-01, SSHes into spark-02 to start the worker container, and forms the Ray cluster. Both containers are started from the same image, so Ray versions are guaranteed to match.
The production recipe is at ~/spark-vllm-docker/recipes/qwen3.5-122b-fp8.yaml. Key parameters baked into the recipe:
model: Qwen/Qwen3.5-122B-A10B-FP8
container: vllm-node-tf5
max_model_len: 262144
max_num_batched_tokens: 8192
mods: mods/fix-qwen3.5-chat-template
env:
HF_HUB_OFFLINE: 1
TRANSFORMERS_OFFLINE: 1
These vars prevent vLLM from attempting to reach huggingface.co on every container start. The model weights are already present in the local HF cache — DNS failures after a power outage or network interruption will not block startup.
Additional flags passed to vllm serve by the recipe:
--load-format fastsafetensors --enable-prefix-caching --enable-auto-tool-choice \
--tool-call-parser qwen3_coder --reasoning-parser qwen3 --chat-template unsloth.jinja \
-tp 2 --distributed-executor-backend ray \
--max-num-batched-tokens 8192 \
--default-chat-template-kwargs '{"enable_thinking": false}'
Launch with gpu_memory_utilization passed as a CLI override (the recipe default is overridden here):
cd ~/spark-vllm-docker
./run-recipe.sh qwen3.5-122b-fp8 -d -- \
--served-model-name qwen3.5-122b \
--gpu-memory-utilization 0.80
The script starts the Ray head on spark-01, then SSHes into spark-02 to start the worker. It waits until both GPUs register with the Ray cluster before starting vllm serve. Do not interrupt between head start and cluster formation — if you need to restart, stop both containers first (docker stop vllm_node on each node), then re-run the launch command on spark-01.
What run-recipe.sh does automatically:
- Sets
NCCL_IB_HCA=rocep1s0f0,roceP2p1s0f0so NCCL pins to both RoCE twins. - Passes
/dev/infinibanddevices into both containers (RDMA verbs + CM). - Forms the Ray cluster (head on
spark-01, worker onspark-02) and waits until both GPUs are registered before startingvllm serve. - Applies the
mods/fix-qwen3.5-chat-templatemod and usesfastsafetensorsfor fast loader I/O.
The FP8 model includes a Multi-Token Prediction (MTP) head (
qwen3_next_mtp), and vLLM supports it via --speculative-config. Do not enable it.As of vLLM v0.19.0, MTP speculative decoding is actively unstable on Qwen3.5-class models:
- Clients sending
min_porlogit_biassampling parameters (default behavior in Open WebUI, Hermes, and most frontends) receive a hard HTTP 500 — "The min_p and logit_bias sampling parameters are not yet supported with speculative decoding" — breaking all inference across every client simultaneously. - Tool calls fail or produce malformed output every 3–4 calls under
qwen3_next_mtp. (vllm-project/vllm#35800) - Long-sequence requests crash with illegal memory access.
- Generation quality degrades across multi-turn sessions, collapsing to 0% draft acceptance rate.
vLLM's own documentation acknowledges speculative decoding is not yet optimized for all sampling parameters. Omit
--speculative-config entirely until upstream fixes these issues. Steady-state throughput without MTP is ~45 tok/s on this hardware — the benefit does not justify the breakage.Watch the launch:
docker logs -f vllm_node
Step 1f — Verification
#### spark-01 — confirm NCCL is using RDMA, not TCP sockets
docker logs vllm_node | grep -E "NET/IB|NET/Socket"
NCCL INFO NET/IB : Using [0]rocep1s0f0:1/IB [1]roceP2p1s0f0:1/IB — NCCL is on RDMA.Bad:
NCCL INFO NET/Socket : Using … — NCCL fell back to TCP; throughput will be ~2–3 tok/s. See Cluster issues.
#### spark-01 — Ray cluster status
docker exec vllm_node ray status
2.0/2.0 GPU, both DAC IPs listed (YOUR_NODE1_DAC_IP and YOUR_NODE2_DAC_IP).#### spark-01 — model endpoint
curl http://localhost:8000/v1/models
curl http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"qwen3.5-122b","messages":[{"role":"user","content":"hi"}],"max_tokens":16}'
{"data":[{"id":"qwen3.5-122b",...}]} on the first call. The first completion takes ~20s while CUDA graphs warm up; subsequent completions stream at ~45 tok/s.#### Both nodes — GPU residency (GB10 quirk)
GB10 uses unified memory. The standard --query-gpu=memory.used,memory.total fields return [N/A] on this hardware — expected. Use plain nvidia-smi and read the Processes section:
nvidia-smi # run on each node
You should see a vllm / RayWorkerWrapper process on each node with roughly ~61 GB resident at the FP8 weight footprint — leaving headroom for KV cache at gpu_memory_utilization=0.80.
Step 1g — Systemd auto-start
Run only on spark-01 — the cluster launcher SSHes into spark-02 to bring up the worker. The 30-second ExecStartPre sleep gives spark-02 time to finish booting and have Docker running before the head SSHes in.
sudo tee /etc/systemd/system/vllm-cluster.service << 'EOF'
[Unit]
Description=vLLM Cluster - Qwen3.5-122B FP8
After=network-online.target docker.service
Wants=network-online.target
Requires=docker.service
[Service]
Type=oneshot
RemainAfterExit=yes
User=YOUR_USERNAME
WorkingDirectory=/home/YOUR_USERNAME/spark-vllm-docker
ExecStartPre=/bin/sleep 30
ExecStart=/home/YOUR_USERNAME/spark-vllm-docker/run-recipe.sh qwen3.5-122b-fp8 -d -- \
--served-model-name qwen3.5-122b \
--gpu-memory-utilization 0.80
TimeoutStartSec=300
TimeoutStopSec=60
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable vllm-cluster.service
systemctl status vllm-cluster on spark-01 should show active (exited) and curl http://localhost:8000/v1/models should return the model list once warmup completes.To manage the cluster after initial setup:
# Start
sudo systemctl start vllm-cluster.service
# Stop
sudo systemctl stop vllm-cluster.service
# Restart (spark-02 worker restarts automatically via launch-cluster.sh)
sudo systemctl restart vllm-cluster.service
# Logs
sudo journalctl -u vllm-cluster.service -f
docker logs -f vllm_node
Performance results
| Metric | Value | Notes |
|---|---|---|
| Hardware | 2× NVIDIA DGX Spark (GB10, 128 GB unified memory each) | Total 256 GB unified pool |
| Model | Qwen/Qwen3.5-122B-A10B-FP8 | 122B / A10B MoE, FP8 (official Qwen) |
| Network | 200 Gb/s DAC (QSFP) · NCCL over RoCE/RDMA | NCCL_IB_HCA=rocep1s0f0,roceP2p1s0f0 |
| Tensor parallel | TP=2 across both nodes | Ray-backed distributed executor |
| First request (CUDA-graph warmup) | ~20 s | One-time cost per cold start |
| Steady-state throughput | ~45 tok/s | Single-stream decode |
| Context window | 262,144 tokens | Full Qwen3.5 context retained |
| Memory per node | ~61 GB resident | FP8 weights; leaves remaining unified memory for KV cache at gpu_memory_utilization=0.80 |
| Speculative decoding | Disabled | MTP (qwen3_next_mtp) is not stable in vLLM v0.19.0 — causes HTTP 500 on requests with standard sampling params. See Step 1e warning. |
| Previous (hand-built, NCCL over TCP) | ~2–3 tok/s | Same hardware, wrong transport |
| Improvement | ~18× | From correct NCCL RoCE configuration alone |
The performance ceiling is GB10 LPDDR5X memory bandwidth (273 GB/s). With 10B active parameters at INT4 (≈5 GB of weight reads per decoded token), the theoretical maximum is ~55 tok/s per node. 45 tok/s is ~82% of the memory-bandwidth ceiling — there is not much performance left on the table.
Bootstrap fallback — the 35B FP8 model
The 35B FP8 model is still useful for fast iteration on cluster wiring (Ray, NCCL, RoCE) before committing to the longer 122B load. The eugr stack ships a recipe for it; if you've populated the cache, fall back with:
cd ~/spark-vllm-docker
./run-recipe.sh qwen3.6-35b-fp8 -d -- \
--served-model-name qwen3.6-35b \
--default-chat-template-kwargs '{"enable_thinking": false}'
Update --served-model-name in the LiteLLM config in Step 02 (and the client's LiteLLM in Step 03) to match if you fall back.
Your LiteLLM proxy on spark-01
This is your LiteLLM proxy — your master key, your SQLite log corpus, your routing rules. It serves only your application stack on spark-01 (your Open WebUI, your Hermes, your n8n). The client gets their own separate LiteLLM in Step 03.
Your LiteLLM lives on the same node as the vLLM head and points at localhost:8000. The clustered vLLM presents one logical OpenAI-compatible endpoint — LiteLLM doesn't need to know there are two physical nodes behind it.
#### spark-01 — directories and install
mkdir -p ~/spark-ai-stack/logs
cd ~/spark-ai-stack
pip3 install litellm[proxy] --break-system-packages
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc && source ~/.bashrc
~/.local/bin/litellm --version#### spark-01 — config (single backend pointing at the local clustered vLLM)
cat > ~/spark-ai-stack/litellm-config.yaml << 'EOF'
model_list:
# Default — thinking off. Used by all clients unless they explicitly select the reasoning model.
- model_name: Qwen3.5-122B-Non-Reasoning
litellm_params:
model: openai/qwen3.5-122b
api_base: http://localhost:8000/v1
api_key: "not-needed"
max_tokens: 8192
extra_body:
chat_template_kwargs:
enable_thinking: false
model_info:
supports_function_calling: true
supports_tool_choice: true
max_context_window: 262144
max_input_tokens: 229376
max_output_tokens: 8192
# Opt-in reasoning — user selects this model explicitly for complex tasks.
- model_name: Qwen3.5-122B-Reasoning
litellm_params:
model: openai/qwen3.5-122b
api_base: http://localhost:8000/v1
api_key: "not-needed"
max_tokens: 32768
extra_body:
chat_template_kwargs:
enable_thinking: true
model_info:
supports_function_calling: true
supports_tool_choice: true
max_context_window: 262144
max_input_tokens: 229376
max_output_tokens: 32768
litellm_settings:
verbose: true
store_model_in_db: true
default_system_message: "You are a highly capable AI assistant. Be direct,
accurate, and concise. Answer immediately without preamble. Never deliberate
out loud about whether or how to answer. For coding: produce complete,
working code. State definitive answers first then explain."
router_settings:
num_retries: 0
timeout: 600
general_settings:
master_key: YOUR_MASTER_KEY
database_url: "postgresql://litellm:litellm@localhost:5432/litellm"
mcp_settings:
allow_all_keys: true
EOF
--default-chat-template-kwargs, and LiteLLM reinforces this per-entry via extra_body. Users select Qwen3.5-122B-Reasoning in Open WebUI, n8n, or any frontend when they need extended reasoning. All other requests use Qwen3.5-122B-Non-Reasoning.#### spark-01 — systemd service
sudo tee /etc/systemd/system/litellm.service << 'EOF'
[Unit]
Description=LiteLLM Proxy
After=network.target docker.service
Wants=docker.service
[Service]
Type=simple
User=YOUR_USERNAME
WorkingDirectory=/home/YOUR_USERNAME/spark-ai-stack
ExecStart=/home/YOUR_USERNAME/.local/bin/litellm --config litellm-config.yaml --port 8001 --host 0.0.0.0
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable litellm
sudo systemctl start litellm
curl http://localhost:8001/v1/models (no key needed if you didn't set master_key yet, otherwise use -H "Authorization: Bearer YOUR_MASTER_KEY")spark-02. If you are using Tailscale ACLs (Step 06), only your tailnet should reach spark-01:8001. The client uses their own LiteLLM (Step 03) — they never call yours.Step 2d — PostgreSQL for the Admin UI and virtual keys (required)
The LiteLLM Admin UI and virtual key system require PostgreSQL — this is why the step exists. LiteLLM's Prisma schema is hardcoded for PostgreSQL, and the Admin UI, virtual key generation, and model management all write to the PostgreSQL metadata database.
#### spark-01 — bring up the litellm-db postgres container
On spark-01 the postgres container lives in ~/spark-ai-stack/docker-compose.yml alongside n8n:
services:
litellm-db:
image: postgres:16
container_name: litellm-db
restart: unless-stopped
environment:
- POSTGRES_USER=litellm
- POSTGRES_PASSWORD=litellm
- POSTGRES_DB=litellm
volumes:
- litellm_db:/var/lib/postgresql/data
ports:
- "5432:5432"
volumes:
litellm_db:
cd ~/spark-ai-stack
docker compose up -d litellm-db
docker compose ps litellm-db
#### spark-01 — apply the Prisma schema
Install the Prisma CLI if missing, then push the LiteLLM Prisma schema into the new database. This must run once on spark-01 before LiteLLM starts, otherwise the UI will return table public.LiteLLM_UserTable does not exist.
pip install prisma --break-system-packages
DATABASE_URL="postgresql://litellm:litellm@localhost:5432/litellm" \
prisma db push \
--schema /home/YOUR_USERNAME/.local/lib/python3.12/site-packages/litellm/proxy/schema.prisma
Your database is now in sync with your Prisma schema. Done in <Ns>#### spark-01 — wire the database into litellm-config.yaml
Add the database_url to general_settings (not at the top level — see troubleshooting) and enable model-in-DB storage so the UI can edit the model list:
general_settings:
master_key: YOUR_MASTER_KEY
database_url: "postgresql://litellm:litellm@localhost:5432/litellm"
litellm_settings:
store_model_in_db: true
Restart and verify:
sudo systemctl restart litellm
sudo systemctl status litellm --no-pager
curl -s http://localhost:8001/health/readiness | head
http://spark-01:8001/ui — log in with admin and your master key. Generate per-service virtual keys from the Virtual Keys tab (one for Open WebUI, one for n8n, one for Hermes — never paste the master key into a downstream service).Client LiteLLM proxy on spark-02
The client gets their own LiteLLM proxy on spark-02, with their own master key, their own log corpus, and their own routing rules. It points at the shared vLLM endpoint over the DAC link. This is not a copy of spark-01's LiteLLM — it has no shared config, no shared key, no shared logs. The client controls their own master key and never shares it with you.
spark-02 on behalf of the client, hand off the master-key generation step (or have them rotate the key the moment they take over). The point of split trust is that you do not hold the client's API credentials.#### spark-02 — install
mkdir -p ~/spark-ai-stack/logs
cd ~/spark-ai-stack
pip3 install litellm[proxy] --break-system-packages
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc && source ~/.bashrc
#### spark-02 — generate the client master key
Run on the client's terminal — store this key only on spark-02:
echo "sk-client-$(openssl rand -hex 16)"
#### spark-02 — config (points at vLLM over the DAC)
cat > ~/spark-ai-stack/litellm-config.yaml << 'EOF'
model_list:
- model_name: Qwen3.5-122B-Non-Reasoning
litellm_params:
model: openai/qwen3.5-122b
api_base: http://YOUR_NODE1_DAC_IP:8000/v1
api_key: "not-needed"
max_tokens: 8192
extra_body:
chat_template_kwargs:
enable_thinking: false
model_info:
supports_function_calling: true
supports_tool_choice: true
max_context_window: 262144
max_input_tokens: 229376
max_output_tokens: 8192
- model_name: Qwen3.5-122B-Reasoning
litellm_params:
model: openai/qwen3.5-122b
api_base: http://YOUR_NODE1_DAC_IP:8000/v1
api_key: "not-needed"
max_tokens: 32768
extra_body:
chat_template_kwargs:
enable_thinking: true
model_info:
supports_function_calling: true
supports_tool_choice: true
max_context_window: 262144
max_input_tokens: 229376
max_output_tokens: 32768
litellm_settings:
verbose: true
database:
type: sqlite
path: /home/YOUR_USERNAME/spark-ai-stack/logs/litellm.db
log_config:
level: INFO
format: json
filepath: /home/YOUR_USERNAME/spark-ai-stack/logs/litellm.log
general_settings:
master_key: YOUR_CLIENT_MASTER_KEY # set to the sk-client-... value above
router_settings:
num_retries: 0
timeout: 600
EOF
YOUR_NODE1_MGMT_IP:8000 unless the DAC is down.#### spark-02 — systemd service (independent of spark-01)
sudo tee /etc/systemd/system/litellm.service << 'EOF'
[Unit]
Description=LiteLLM Proxy (client)
After=network.target docker.service
Wants=docker.service
[Service]
Type=simple
User=YOUR_USERNAME
WorkingDirectory=/home/YOUR_USERNAME/spark-ai-stack
ExecStart=/home/YOUR_USERNAME/.local/bin/litellm --config litellm-config.yaml --port 8001 --host 0.0.0.0
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable litellm
sudo systemctl start litellm
sudo systemctl status litellm --no-pager
Verification — confirm the request hits spark-01:8000
#### spark-02 — local LiteLLM responds with the client key
curl http://localhost:8001/v1/models \
-H "Authorization: Bearer YOUR_CLIENT_MASTER_KEY"
#### spark-02 — inference round-trip through the shared backend
curl http://localhost:8001/v1/chat/completions \
-H "Authorization: Bearer YOUR_CLIENT_MASTER_KEY" \
-H 'Content-Type: application/json' \
-d '{"model":"qwen3.5-122b","messages":[{"role":"user","content":"hi from client"}],"max_tokens":16}'
tail -f ~/spark-ai-stack/logs/litellm.log shows nothing — your LiteLLM is not in the path. Instead, run docker logs --tail 20 vllm_node on spark-01 — you should see the new request reach the vLLM head.Step 3d — PostgreSQL for the client Admin UI and virtual keys (required)
The LiteLLM Admin UI and virtual key system require PostgreSQL — same requirement as Step 02. Set up PostgreSQL on spark-02 so the client's Admin UI, virtual key generation, and model management work correctly. spark-02 doesn't run a docker-compose stack, so we use a standalone postgres container.
#### spark-02 — standalone postgres container
docker run -d --name litellm-db --restart unless-stopped \
-e POSTGRES_USER=litellm \
-e POSTGRES_PASSWORD=litellm \
-e POSTGRES_DB=litellm \
-p 5432:5432 \
-v litellm_db:/var/lib/postgresql/data \
postgres:16
#### spark-02 — apply the Prisma schema
pip install prisma --break-system-packages
DATABASE_URL="postgresql://litellm:litellm@localhost:5432/litellm" \
prisma db push \
--schema /home/YOUR_USERNAME/.local/lib/python3.12/site-packages/litellm/proxy/schema.prisma
Your database is now in sync with your Prisma schema.#### spark-02 — wire the database into the client litellm-config.yaml
database_url goes under general_settings, alongside the existing master_key. Add store_model_in_db: true under litellm_settings:
general_settings:
master_key: YOUR_CLIENT_MASTER_KEY
database_url: "postgresql://litellm:litellm@localhost:5432/litellm"
litellm_settings:
store_model_in_db: true
sudo systemctl restart litellm
sudo systemctl status litellm --no-pager
http://spark-02:8001/ui — log in with admin and the client master key. The client generates their own per-app virtual keys from the Virtual Keys tab; you never see them.prisma db push), the old key will appear in the UI but lookups will fail with Virtual key not found in LiteLLM_VerificationTokenTable. Delete it in the UI, restart LiteLLM, then generate a new one.Your Open WebUI on spark-01
Your daily-driver chat interface, owned by you, on your node. It points at your LiteLLM at http://localhost:8001/v1. The client gets their own Open WebUI on spark-02 in Step 05 — neither side can see the other's chat history, knowledge bases, RAG documents, or API keys.
#### spark-01 — directory and run
mkdir -p ~/spark-ai-stack
cd ~/spark-ai-stack
docker run -d \
--name open-webui \
--restart unless-stopped \
-p 8080:8080 \
-v open-webui:/app/backend/data \
-e OPENAI_API_BASE_URL="http://host.docker.internal:8001/v1" \
-e OPENAI_API_KEY="YOUR_MASTER_KEY" \
-e WEBUI_AUTH=True \
-e ENABLE_OLLAMA_API=False \
--add-host=host.docker.internal:host-gateway \
ghcr.io/open-webui/open-webui:main
Visit http://localhost:8080 from spark-01 (or via your tailnet — see Step 06), create your admin account, and confirm in Settings → Connections → OpenAI API:
| Setting | Value |
|---|---|
| API Base URL | http://host.docker.internal:8001/v1 |
| API Key | your master_key |
| Default model | Qwen3.5-122B-Non-Reasoning |
| Memory | Toggle ON (Settings → Personalization) |
Client Open WebUI on spark-02
The client's daily-driver chat interface, owned by the client, on the client's node. It points at the client's LiteLLM at http://localhost:8001/v1 — which in turn calls the shared vLLM head on spark-01:8000 over the DAC.
The client's data lives on the client's node. Their Open WebUI database, knowledge bases, RAG document store, embedding indexes, conversation history, attached files, and account list — all of it is in the open-webui Docker volume on spark-02. None of it is replicated to spark-01. If you spin spark-02 down, the client's UI state goes with it; if you image spark-01, the client's state is not in your image.
#### spark-02 — directory and run
mkdir -p ~/spark-ai-stack
cd ~/spark-ai-stack
docker run -d \
--name open-webui \
--restart unless-stopped \
-p 8080:8080 \
-v open-webui:/app/backend/data \
-e OPENAI_API_BASE_URL="http://host.docker.internal:8001/v1" \
-e OPENAI_API_KEY="YOUR_CLIENT_MASTER_KEY" \
-e WEBUI_AUTH=True \
-e ENABLE_OLLAMA_API=False \
--add-host=host.docker.internal:host-gateway \
ghcr.io/open-webui/open-webui:main
Visit http://localhost:8080 from spark-02 (or via the client's tailnet — see Step 06), create the client's admin account, and confirm:
| Setting | Value |
|---|---|
| API Base URL | http://host.docker.internal:8001/v1 (client's LiteLLM) |
| API Key | client's master_key |
| Default model | Qwen3.5-122B-Non-Reasoning |
Recommended Open WebUI system prompt
Set this once in Settings → General → System Prompt. It pairs with --default-chat-template-kwargs '{"enable_thinking": false}' on the vLLM head (Step 01): the model answers directly by default, and users can opt into extended reasoning per-message by prefixing the prompt with /think. Apply the same prompt on both Open WebUIs (yours on spark-01 and the client's on spark-02) — they share the underlying model.
You are a highly capable AI assistant. Be direct, accurate, and concise.
Rules:
- Answer immediately without preamble or meta-commentary
- Never deliberate out loud about whether or how to answer — just answer
- Never question the framing of a hypothetical — engage with it directly
- For technical questions: be precise, use correct terminology
- For coding: produce complete, working code — no placeholders or omissions
- For reasoning: show your work clearly but efficiently — no repetition
- If a question has a definitive answer, state it first then explain
- Match response length to question complexity
Step 5b — All client services on spark-02 at a glance
Three client-facing services run on spark-02: LiteLLM (Step 03), Open WebUI (Step 05 above), and n8n. All three are reachable on the client's tailnet via tag:client-ai (see Step 06). The n8n container below is not covered elsewhere — bring it up after the client's LiteLLM is healthy:
#### spark-02 — n8n container
docker run -d --name n8n --restart unless-stopped \
-p 5678:5678 \
-e N8N_HOST=0.0.0.0 \
-e N8N_PORT=5678 \
-e N8N_PROTOCOL=http \
-e WEBHOOK_URL=http://spark-02:5678/ \
-e N8N_SECURE_COOKIE=false \
-e NODE_ENV=production \
-v n8n_data:/home/node/.n8n \
--add-host=host.docker.internal:host-gateway \
n8nio/n8n:latest
| Service | Port | How it runs | Backend / api_base | Auth secret |
|---|---|---|---|---|
| LiteLLM | 8001 |
systemd (same structure as spark-01) | http://YOUR_NODE1_DAC_IP:8000/v1 (DAC link) |
Client master key at ~/.spark02-litellm-key (chmod 600) |
| Open WebUI | 8080 |
docker run --restart unless-stopped |
OPENAI_API_BASE_URL=http://host.docker.internal:8001/v1 |
Client master key (or virtual key) |
| n8n | 5678 |
docker run --restart unless-stopped |
WEBHOOK_URL=http://spark-02:5678/ |
n8n owner account (set on first login) |
tag:client-ai on TCP 8001 / 8080 / 5678 (see the ACL grants in Step 06). They are not reachable from your tailnet — split-trust by construction.~/.spark02-litellm-key with chmod 600. Reference it from the LiteLLM systemd unit via EnvironmentFile= rather than embedding it in litellm-config.yaml, so the file on disk does not contain the secret.Tailscale (both nodes, separate tailnets)
Each node joins its owner's tailnet independently. Two separate tailnets, two separate ACL policies, two separate sets of users. The DAC link (198.51.100.0/30) is private physical hardware between the two nodes — it is not advertised onto either tailnet, and it is not used for any cross-tailnet routing.
spark-01:8000 has no authentication. Once Tailscale is configured, ensure your Tailscale ACLs do not expose port 8000 to the client's tailnet (and the client's ACLs do not expose your node's 8000 port to anyone either). The client must only reach spark-02:8001 (their own LiteLLM). If they can reach spark-01:8000 directly, they bypass their LiteLLM entirely and have unauthenticated inference access — which also means no key-scoped logging, no rate limit, and no audit trail.spark-01 — your tailnet
#### spark-01 — install + join
curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up --hostname=spark-01 --advertise-tags=tag:owner
tailscale ip -4 # note this address — your apps will be reachable here
In your tailnet's ACL policy (Tailscale admin console), expose your Open WebUI, your LiteLLM, and your other apps only to your own users. Example ACL fragment:
{
"acls": [
{ "action": "accept",
"src": ["group:owner-users"],
"dst": ["tag:owner:8080", "tag:owner:8001", "tag:owner:5678", "tag:owner:9119"]
},
{ "action": "accept",
"src": ["group:owner-users"],
"dst": ["tag:owner:22"]
}
],
"tagOwners": {
"tag:owner": ["YOU@example.com"]
},
"groups": {
"group:owner-users": ["YOU@example.com"]
}
}
Do NOT add tag:owner:8000 to any allow rule. Port 8000 (vLLM) is unauthenticated and must remain reachable only from localhost (your LiteLLM in Step 02) and the DAC IP 198.51.100.1 (the client's LiteLLM in Step 03).
spark-02 — client's tailnet
#### spark-02 — install + join (client's auth key)
curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up \
--authkey=<client-auth-key> \
--hostname=spark-02 \
--advertise-tags=tag:client-ai
tailscale ip -4 # client's apps will be reachable here, only on their tailnet
The client's tailnet ACLs are theirs to author. Mirror the structure above with their own users and tags. The client should expose :8080 (their Open WebUI), :8001 (their LiteLLM, only if they want programmatic access from elsewhere), and :5678 (their n8n).
spark-02 in the client's Tailscale admin (Machines → spark-02 → ⋯ → Disable key expiry). Server nodes shouldn't drop off the tailnet on a 90-day timer; the auth key rotation should be a deliberate action.#### Client tailnet ACL — replace the default allow-all
The default Tailscale policy allows everything between everyone. Replace it with this grants-based policy. It leaves all non-server devices unrestricted (so the client's existing fleet is untouched), keeps tag:client-server open (their existing Dell server stays as-is), and restricts tag:client-ai (this node) to only the three service ports — 8001 (LiteLLM), 8080 (Open WebUI), 5678 (n8n).
{
"grants": [
{
"src": ["*"],
"dst": ["autogroup:member"],
"ip": ["*"]
},
{
"src": ["*"],
"dst": ["tag:client-server"],
"ip": ["*"]
},
{
"src": ["*"],
"dst": ["tag:client-ai"],
"ip": ["tcp:8001", "tcp:8080", "tcp:5678"]
}
],
"tagOwners": {
"tag:client-ai": ["autogroup:admin"],
"tag:client-server":["autogroup:admin"]
}
}
nc -zv spark-02 8001, nc -zv spark-02 8080, and nc -zv spark-02 5678 all succeed. nc -zv spark-02 22 (SSH) and nc -zv spark-02 8000 (raw vLLM) both fail — proving the ACL is in effect.tag:client-ai is intentionally not given tcp:8000. Port 8000 is the raw, unauthenticated vLLM head on spark-01 reached over the DAC; it must never be exposed onto the client's tailnet. The client only ever calls their own LiteLLM on spark-02:8001.Lock down host firewalls
Tailscale ACLs are policy; the host firewall is enforcement. Apply ufw rules on both nodes so that even if Tailscale is misconfigured, ports 8000 and 8001 cannot leak to the wrong network.
#### spark-01 — host firewall
# Allow from your tailnet (interface tailscale0) and DAC peer only
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow in on tailscale0 to any port 8001 proto tcp # your LiteLLM
sudo ufw allow in on tailscale0 to any port 8080 proto tcp # your Open WebUI
sudo ufw allow in on tailscale0 to any port 5678 proto tcp # your n8n
sudo ufw allow in on tailscale0 to any port 9119 proto tcp # your Hermes dashboard
sudo ufw allow in on enp1s0f0np0 from YOUR_NODE2_DAC_IP to any port 8000 proto tcp # client LiteLLM → vLLM only
sudo ufw allow in on enp1s0f0np0 from YOUR_NODE2_DAC_IP to any port 6379 proto tcp # Ray GCS over DAC
sudo ufw allow ssh # mgmt LAN ssh
sudo ufw enable
#### spark-02 — host firewall
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow in on tailscale0 to any port 8001 proto tcp # client LiteLLM
sudo ufw allow in on tailscale0 to any port 8080 proto tcp # client Open WebUI
sudo ufw allow in on tailscale0 to any port 5678 proto tcp # client n8n
sudo ufw allow in on enp1s0f0np0 from YOUR_NODE1_DAC_IP # NCCL/Ray over DAC
sudo ufw allow ssh
sudo ufw enable
nc -zv spark-01-tailscale-ip 8000 should fail with "connection refused" or "filtered". From the client's tailnet, curl http://spark-02-tailscale-ip:8001/v1/models -H "Authorization: Bearer CLIENT_KEY" should succeed.Your Hermes Agent on spark-01
Hermes is your autonomous agent layer (skills, memory, cron, gateways). It runs on your node and talks to your LiteLLM. The client does not get a Hermes — they have their own Open WebUI and n8n on spark-02; if they want an agent they install their own.
localhost:11434. This stack does not run Ollama — Hermes points at LiteLLM at localhost:8001 as a custom OpenAI-compatible endpoint. Ignore any Ollama references in the NVIDIA guide; the wizard answers in this step are the correct ones for this stack.
#### spark-01 — install
ripgrep (fast file search, used by agent tools) and ffmpeg (required for TTS/voice message features) via apt. Both are safe to accept. If you are running unattended (e.g. piped from curl without a TTY), the installer skips the sudo prompt and logs a warning — install them manually first to avoid the fallback:
sudo apt install -y ripgrep ffmpegcurl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc && source ~/.bashrc
hermes --version
Python 3.11 Managed by
uv (Astral). Virtual env at ~/.hermes/hermes-agent/venv/. uv itself installs to ~/.local/bin/uv and ~/.cargo/bin/uv.Node.js 22 Downloaded to
~/.hermes/node/ with symlinks in ~/.local/bin/. Powers browser automation tools (Playwright-based).hermes CLI Symlinked to
~/.local/bin/hermes. The PATH line added to ~/.bashrc by the installer covers this — source it or open a new shell before running hermes setup.Submodules initialized automatically:
mini-swe-agent Terminal/shell tool backend. Gives the agent the ability to run arbitrary shell commands on spark-01. Review what workspace access you grant in
hermes setup — prefer limiting to ~/workspace.tinker-atropos RL-based skill self-improvement backend. Lets Hermes refine its own skill files after completing complex tasks.
#### spark-01 — setup wizard
hermes setup
| Prompt | Answer |
|---|---|
| Setup type | Full setup |
| Provider | Custom endpoint |
| API base URL | http://localhost:8001/v1 |
| API key | A virtual key scoped to Hermes from the LiteLLM Admin UI (recommended — gives a per-service audit trail). The master key also works but grants unrestricted access. |
| Model | Qwen3.5-122B-Non-Reasoning (the LiteLLM model alias — Hermes calls LiteLLM, which maps this to the vLLM backend) |
| Terminal backend | Local — set workspace to ~/workspace (limits shell tool scope to a safe directory) |
| Session reset mode | Inactivity + daily reset |
| Search provider | Skip (or Brave Search if configured — see Brave Search MCP section) |
| Launch chat now? | n |
The wizard writes ~/.hermes/config.yaml. base_url is local — Hermes and your LiteLLM both live on spark-01.
~/.hermes/SOUL.md to define Hermes' personality. This file is re-read on every message — changes take effect immediately without restarting the gateway. Leave it empty to use the default personality.Example for a terse technical assistant:
You are a concise, direct technical assistant. No filler. No preamble. State the answer first, then explain if needed. Use correct terminology.
#### Hermes file locations
| Path | Contents |
|---|---|
~/.hermes/config.yaml | Main config — model, endpoint, providers |
~/.hermes/.env | API keys and gateway tokens |
~/.hermes/SOUL.md | Agent persona/tone — hot-reloaded each message, no restart needed |
~/.hermes/skills/ | Persistent skill documents (agentskills.io format) — bundled skills seeded here automatically on install |
~/.hermes/memories/ | Long-term memory store |
~/.hermes/sessions/ | Conversation session state |
~/.hermes/logs/ | Gateway and agent logs |
~/.hermes/cron/ | Scheduled task definitions |
~/.hermes/hermes-agent/ | Cloned repo + venv (managed — do not edit directly) |
~/.hermes/node/ | Node.js 22 runtime (managed — do not edit directly) |
Telegram gateway
Create a bot via @BotFather (/newbot, copy the token) and get your user ID from @userinfobot. Then:
hermes setup gateway # select Telegram, paste token, paste user ID, choose System service
sudo /home/YOUR_USERNAME/.local/bin/hermes gateway install --system
sudo systemctl start hermes-gateway
sudo systemctl status hermes-gateway --no-pager
sudo does not inherit your user's $PATH. Alternatively, hermes gateway install (without sudo) installs a user-scope service instead of a system service — both work, but the system service starts on boot without a logged-in user session.
Built-in dashboard
Hermes ships a built-in web dashboard for managing config, API keys, and sessions. Run it as a systemd service so it starts automatically alongside the gateway.
#### spark-01 — create the dashboard service
sudo tee /etc/systemd/system/hermes-dashboard.service << 'EOF'
[Unit]
Description=Hermes Agent Dashboard
After=network.target hermes-gateway.service
Wants=hermes-gateway.service
[Service]
Type=simple
User=YOUR_USERNAME
ExecStart=/home/YOUR_USERNAME/.local/bin/hermes dashboard --port 9119 --host 0.0.0.0 --insecure --no-open
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable hermes-dashboard
sudo systemctl start hermes-dashboard
sudo systemctl status hermes-dashboard --no-pager
The --insecure flag is required to bind to 0.0.0.0 instead of localhost. Protect port 9119 with Tailscale ACLs (Step 06) — it should only be reachable from your tailnet, not the open internet.
http://YOUR_NODE1_TAILSCALE_IP:9119 from any device on your tailnet.hermes-dashboard.service will crash-loop silently (restart counter >3600) on a fresh install because the dashboard frontend has never been compiled. The system Node.js installed in the steps above is for system-wide use (e.g. npx MCP servers); the dashboard build needs Node.js from within the Hermes install path.Run this once after every
hermes update:
curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash -
sudo apt install -y nodejs
cd ~/.hermes/hermes-agent/web && npm install && npm run build
sudo systemctl restart hermes-dashboard
sudo systemctl status hermes-dashboard --no-pagercurl -s http://127.0.0.1:9119/api/status
Dashboard — basic auth (required for remote access)
Binding the dashboard to 0.0.0.0 exposes it to the local network. Add basic auth so it requires a username and password even inside the tailnet. Set these in ~/.hermes/.env on spark-01 (not in config.yaml):
HERMES_DASHBOARD_BASIC_AUTH_USERNAME=YOUR_DASHBOARD_USERNAME
HERMES_DASHBOARD_BASIC_AUTH_PASSWORD=YOUR_DASHBOARD_PASSWORD
Restart the dashboard after setting these. Verify auth is active:
sudo systemctl restart hermes-dashboard
curl -s http://127.0.0.1:9119/api/status | grep auth
"auth_required": true, "auth_providers": ["basic"]Dashboard — auto-restart on gateway update
After hermes update (or any hermes-gateway restart), the dashboard process goes stale and must be restarted. Wire this up automatically via a systemd override and a sudoers rule — the gateway service runs as your user, which can't call systemctl on system services without sudo.
#### spark-01 — sudoers rule
sudo tee /etc/sudoers.d/hermes-dashboard-restart <<'EOF'
YOUR_USERNAME ALL=(root) NOPASSWD: /usr/bin/systemctl --no-block try-restart hermes-dashboard.service
EOF
#### spark-01 — gateway service override
sudo mkdir -p /etc/systemd/system/hermes-gateway.service.d
sudo tee /etc/systemd/system/hermes-gateway.service.d/override.conf <<'EOF'
[Service]
ExecStartPost=-/usr/bin/sudo -n /usr/bin/systemctl --no-block try-restart hermes-dashboard.service
EOF
sudo systemctl daemon-reload && sudo systemctl restart hermes-gateway
- prefix on ExecStartPost is required — it prevents a sudo failure from propagating and crashing the gateway. --no-block fires the dashboard restart asynchronously so it doesn't delay the gateway startup.
Hermes desktop app (v0.15.2+)
The Hermes desktop app (released June 2026) connects to your dashboard on spark-01 — it does not run its own backend. Without the remote URL set before first launch, the app spins up a local backend, SIGTERMs it, and enters a reset loop.
HERMES_DESKTOP_REMOTE_URL before launching the app for the first time. Add to ~/.hermes/.env on your Mac (not on spark-01):
HERMES_DESKTOP_REMOTE_URL=http://YOUR_NODE1_TAILSCALE_IP:9119
HERMES_DESKTOP_REMOTE_TOKEN=YOUR_DASHBOARD_SESSION_TOKEN
HERMES_DESKTOP_REMOTE_TOKEN must match HERMES_DASHBOARD_SESSION_TOKEN in ~/.hermes/.env on spark-01. The dashboard basic auth credentials (username/password) are separate from this token.
Hermes memory limits
Default memory limits can cause the agent to truncate context or behave inconsistently on long sessions. These values are confirmed stable on this stack:
hermes config set memory_char_limit 6000
hermes config set user_char_limit 3000
sudo systemctl restart hermes-gateway
hermes config set — do not hand-edit ~/.hermes/config.yaml for these values. The CLI applies validation and correct encoding that the YAML parser may not catch.
Keeping Hermes up to date
# spark-01: Update Hermes to the latest release
hermes update
# Pulls latest from main, applies dependency changes, restarts the gateway service.
# Confirm it came back up:
sudo systemctl status hermes-gateway --no-pager
Your n8n on spark-01
Your single-instance n8n on spark-01. The client runs their own n8n on spark-02 independently — different workflows, different credentials, different Postgres-vs-SQLite state. Neither side can read the other's flows.
#### spark-01 — docker-compose for n8n
cat > ~/spark-ai-stack/n8n.yml << 'EOF'
services:
n8n:
image: n8nio/n8n:latest
container_name: n8n
restart: unless-stopped
ports:
- "5678:5678"
environment:
- N8N_HOST=0.0.0.0
- N8N_PORT=5678
- N8N_PROTOCOL=http
- WEBHOOK_URL=http://YOUR_TAILNET_HOSTNAME:5678/
- N8N_SECURE_COOKIE=false
- NODE_ENV=production
- GENERIC_TIMEZONE=America/Los_Angeles
volumes:
- n8n_data:/home/node/.n8n
extra_hosts:
- "host.docker.internal:host-gateway"
volumes:
n8n_data:
EOF
docker compose -f ~/spark-ai-stack/n8n.yml up -d
http://localhost:5678 on spark-01 (or your tailnet hostname) and create the owner account.Wire your n8n to your LiteLLM
In n8n, add an OpenAI credential pointing at your LiteLLM (local on spark-01):
| Field | Value |
|---|---|
| API URL | http://host.docker.internal:8001/v1 |
| API Key | your master_key |
| Default model | qwen3.5-122b |
spark-02 follows the same pattern but points at their LiteLLM (http://host.docker.internal:8001/v1 from inside their n8n container) with the client's master key. Step 03 covers the client's LiteLLM; the client deploys their n8n the same way you deploy yours.Stack Validation Checklist
Both sides have to pass independently, and each side has to fail in the right places (you should not be able to reach the client's stack, and vice versa). Every item is labeled with the node it should be run from.
Shared compute pool
- spark-01:
docker exec vllm_node ray statusshows 2 nodes and 2.0/2.0 GPU, with both DAC IPs (YOUR_NODE1_DAC_IPandYOUR_NODE2_DAC_IP) listed. - spark-01:
docker logs vllm_node | grep "NCCL INFO NET/IB"shows bothrocep1s0f0androceP2p1s0f0— NCCL is on RoCE/RDMA, not TCP sockets. - spark-01:
nvidia-smiProcesses section shows avllm/RayWorkerWrapperat ~61 GB (122B FP8). The--query-gpu=memory.used/memory.totalfields return[N/A]on GB10 — expected, use the Processes section instead. - spark-02:
nvidia-smiProcesses section shows avllm/RayWorkerWrapperat ~61 GB. - spark-01:
curl http://localhost:8000/v1/modelsreturnsqwen3.5-122b(vLLM head). A streamed completion runs at ~45 tok/s steady state.
Your side (spark-01)
- spark-01:
curl http://localhost:8001/v1/models -H "Authorization: Bearer YOUR_MASTER_KEY"returnsqwen3.5-122bthrough your LiteLLM. - spark-01:
sudo systemctl status litellm hermes-gateway— both active (running). - spark-01:
docker psshowsvllm_node,open-webui, andn8nall Up.sudo systemctl status hermes-dashboardis active (running). - spark-01: Open
http://localhost:8080(or your tailnet hostname), send a chat message — completion streams back. Your LiteLLM log records the request. - spark-01: Send a Telegram message — Hermes responds. End-to-end your-side path: Telegram → Hermes → your LiteLLM → vLLM TP=2. Confirm in LiteLLM logs that the request arrived on the Hermes virtual key, not the master key.
- spark-01:
hermes --versionreturns a version string. If MCP servers are configured:hermes mcp test <server-name>verifies MCP connectivity. - spark-01:
sudo systemctl status hermes-gateway hermes-dashboard— both active (running). - spark-01:
~/.hermes/SOUL.mdexists (created by installer).~/.hermes/skills/is populated with bundled skills seeded by the installer.
Client side (spark-02)
- spark-02:
curl http://localhost:8001/v1/models -H "Authorization: Bearer YOUR_CLIENT_MASTER_KEY"returnsqwen3.5-122bthrough the client's LiteLLM. - spark-02:
curl http://localhost:8001/v1/chat/completions -H "Authorization: Bearer YOUR_CLIENT_MASTER_KEY" ...returns a completion. - spark-02:
sudo systemctl status litellm— active (running). Logs are written to~/spark-ai-stack/logs/litellm.logonspark-02, separate from your logs. - spark-02:
docker psshowsvllm_node(the worker),open-webui, andn8nall Up. (No Hermes container.) - spark-02: Open
http://localhost:8080(or the client's tailnet hostname), send a chat message — completion streams back. Client's LiteLLM log records the request; your LiteLLM log on spark-01 does not.
Cross-stack isolation (the negative tests)
- From client's tailnet:
curl http://YOUR_NODE1_TAILSCALE_IP:8001/v1/modelsshould fail (timeout or "no route to host"). Your LiteLLM is not exposed to the client's tailnet. - From client's tailnet:
curl http://YOUR_NODE1_TAILSCALE_IP:8000/v1/modelsshould fail. The unauthenticated vLLM endpoint is not reachable from the client's tailnet. - From client's tailnet:
curl http://YOUR_NODE1_TAILSCALE_IP:8080should fail. Your Open WebUI is not reachable from the client's tailnet. - From your tailnet:
curl http://YOUR_NODE2_TAILSCALE_IP:8080should fail. The client's Open WebUI is not reachable from your tailnet. - spark-01:
tail -f ~/spark-ai-stack/logs/litellm.logwhile the client sends a chat message → your log is silent. Client's traffic does not enter your stack.
DAC traffic during inference
- spark-02:
nload enp1s0f0np0spikes into Gb/s during decoding from either side — both your and the client's inference requests traverse the DAC (yours for NCCL collectives, theirs for both the API call to198.51.100.1:8000and NCCL). - Both sides simultaneously: have you and the client send a long prompt at the same time. Both completions should stream concurrently — vLLM's continuous batching handles the overlap.
Config validation
-
spark-01 (your Hermes config): YAML validation passes — a parse error will silently break Hermes by falling back to
.env:Expected output:bashpython3 -c "import yaml; yaml.safe_load(open('/home/YOUR_USERNAME/.hermes/config.yaml')); print('YAML valid')"YAML valid
Port reference
spark-01 (your node)
spark-02 (client node)
File locations
spark-01 — your node
├── .env # LOCAL_IP, ETH_IF, IB_IF, CONTAINER_HF_TOKEN — written by --discover
├── build-and-copy.sh # builds vllm-node-tf5:latest + copies to spark-02
├── run-recipe.sh # launches Ray + vllm serve from the chosen recipe
└── launch-cluster.sh # stop/teardown helper invoked by systemd
~/spark-ai-stack/
├── litellm-config.yaml # YOUR LiteLLM — your master_key, localhost:8000
├── n8n.yml # compose for your n8n
└── logs/
├── litellm.db # YOUR SQLite request log
└── litellm.log # YOUR text log
~/.hermes/ # YOUR Hermes config, memory, skills
~/workspace/ # YOUR Hermes file workspace
Docker volumes:
├── open-webui # YOUR Open WebUI database, knowledge bases, RAG
└── n8n_data # YOUR n8n flows + credentials
/etc/systemd/system/
├── vllm-cluster.service # vLLM cluster auto-start (head SSHes to worker)
├── litellm.service # YOUR LiteLLM auto-start
├── litellm.service.d/override.conf # PYTHONPATH only
├── hermes-gateway.service # YOUR Telegram gateway auto-start
└── hermes-dashboard.service # YOUR Hermes dashboard auto-start
spark-02 — client node (separate ownership)
└── models--Qwen--Qwen3.5-122B-A10B-FP8/
~/spark-ai-stack/
├── litellm-config.yaml # CLIENT LiteLLM — client's master_key, points DAC → vLLM
├── n8n.yml # compose for client's n8n
└── logs/
├── litellm.db # CLIENT SQLite log — separate corpus
└── litellm.log # CLIENT text log
Docker image (pushed from spark-01):
└── vllm-node-tf5:latest # identical image as spark-01; worker container started by spark-01's launcher over SSH
Docker volumes:
├── open-webui # CLIENT Open WebUI database, knowledge bases, RAG
└── n8n_data # CLIENT n8n flows + credentials
/etc/systemd/system/
└── litellm.service # CLIENT LiteLLM auto-start
Backup targets
| Node / Owner | Path | Contents |
|---|---|---|
| spark-01 — you | ~/spark-ai-stack/logs/ | Your LiteLLM corpus |
| spark-01 — you | ~/.hermes/ | Your Hermes memory, sessions, skills |
| spark-01 — you | Docker volumes open-webui, n8n_data | Your UI state — chat history, knowledge bases, flows |
| spark-02 — client | ~/spark-ai-stack/logs/ | Client's LiteLLM corpus (their backup, not yours) |
| spark-02 — client | Docker volumes open-webui, n8n_data | Client's UI state (their backup, not yours) |
Cluster issues
Two-node-specific failures and their resolutions. Most of these were discovered during live deployment.
nload on the DAC shows traffic, but only a fraction of link capacity./dev/infiniband wasn't passed into the container, NCCL_IB_HCA wasn't set, or both. The hand-built vllm-spark:26.04 image had this problem by design — RDMA passthrough was never plumbed in.NCCL_IB_HCA=rocep1s0f0,roceP2p1s0f0 and passes /dev/infiniband into both containers automatically. If you're on the old hand-built image, migrate — there is no equivalent flag fix. Confirm the transport:
docker logs vllm_node | grep -E "NET/IB|NET/Socket"
# NET/IB = RDMA (good)
# NET/Socket = TCP fallback (bad)Sehyo/Qwen3.5-122B-A10B-NVFP4) loads weights to ~37 GB on both nodes, the worker registers with Ray, and then everything stops. No errors, no progress, no completion of engine init.eugr/spark-vllm-docker (the recipe is marked solo_only: true) and vllm/vllm-openai:cu130-nightly. The Sehyo/Qwen3.5-122B-A10B-NVFP4 checkpoint additionally has weight-name mismatches with newer vLLM fused Mamba layer names.Qwen/Qwen3.5-122B-A10B-FP8 for multi-node inference (Step 01 production model). Do not attempt NVFP4 with TP=2 multi-node until community support is confirmed.spark-01 but the older image is still on spark-02) or one node is running an out-of-date image tag.Ray runtime started in docker logs vllm_node, then start the worker. Both nodes must run identical images — re-run the copy step whenever you bump the image on either side:
cd ~/spark-vllm-docker
./build-and-copy.sh --tf5 --copy-to YOUR_NODE2_DAC_IPnode:192.0.2.x not foundVLLM_HOST_IP is unset, so vLLM resolves the node's hostname to the mgmt IP (192.0.2.x) — but Ray registered each node on its DAC IP (198.51.100.x). The placement group spec then targets a node Ray has never seen.VLLM_HOST_IP to each container's own DAC IP: on spark-01's head container, VLLM_HOST_IP=YOUR_NODE1_DAC_IP; on spark-02's worker container, VLLM_HOST_IP=YOUR_NODE2_DAC_IP. The eugr launcher's ./run-recipe.sh --discover (Step 1c) sets LOCAL_IP in .env per node and the cluster launcher exports it as VLLM_HOST_IP into each container.run-recipe.sh launcher blocks vllm serve behind this check, so under normal operation you won't see it. Confirm once the worker joins:
docker exec vllm_node ray status
# Expected: 2.0/2.0 GPUdocker logs vllm_node on spark-02) print "ConnectionError: ... 198.51.100.1:6379" continuously and never advance.LOCAL_IP on the head's .env isn't the DAC IP so Ray bound to the wrong interface, or (c) a firewall blocks port 6379 over the DAC.LOCAL_IP=YOUR_NODE1_DAC_IP in ~/spark-vllm-docker/.env on the head. Check Ray, restart if needed, then probe port 6379 from the worker:
docker exec vllm_node ray status
# if Ray isn't running:
cd ~/spark-vllm-docker && ./run-recipe.sh qwen3.5-122b-fp8 -d -- --served-model-name qwen3.5-122b --gpu-memory-utilization 0.80nc -zv YOUR_NODE1_DAC_IP 6379
# should print: Connection to YOUR_NODE1_DAC_IP 6379 port [tcp/*] succeedednload enp1s0f0np0 stays idle; mgmt LAN sees Gb/s spikes instead.NCCL_IB_HCA wasn't set so NCCL couldn't find the RoCE devices and fell back to TCP auto-detect, picking the first interface it saw (often mgmt), or (b) only one of the two RoCE twins is listed in NCCL_IB_HCA and NCCL silently went to TCP rather than use one twin.IB_IF=rocep1s0f0,roceP2p1s0f0 (both twins, comma-separated) and ETH_IF=enp1s0f0np0 in ~/spark-vllm-docker/.env on both nodes. Verify NCCL picked up the RoCE devices:
docker logs vllm_node | grep "NCCL INFO NET/IB"
# Expected: both rocep1s0f0 and roceP2p1s0f0 listednvidia-smi memory.used returns [N/A] on GB10nvidia-smi --query-gpu=memory.used,memory.total --format=csv returns [N/A] in both fields. Memory monitoring scripts that depend on these fields silently report nothing.memory.used / memory.total NVML fields are not populated on this hardware.nvidia-smi and read the Processes section. With the 122B FP8 model loaded you should see a vllm / RayWorkerWrapper process at ~61 GB per node; 35B FP8 lands closer to ~97 GB.
nvidia-smi
# Read the Processes section — memory.used / memory.total fields return [N/A] on GB10ssh spark-02 from spark-01 hangs or returns "connection refused", even though both nodes are reachable.198.51.100.x). Unless you've explicitly bound sshd to the DAC interface, SSH only listens on the mgmt LAN — but your client has resolved the hostname to the DAC IP./etc/hosts on both nodes (covered in Prerequisites). The HF cache rsync uses the DAC IP explicitly, but every other ssh spark-0X command relies on hostname resolution.
echo "YOUR_NODE1_MGMT_IP spark-01" | sudo tee -a /etc/hosts
echo "YOUR_NODE2_MGMT_IP spark-02" | sudo tee -a /etc/hostsdocker: permission denied on spark-02permission denied while trying to connect to the Docker daemon socket. Common on a freshly-imaged worker node where docker installed cleanly but the operator account isn't in the docker group yet.docker group. The new group only takes effect after re-logging or running newgrp docker in the current shell.
sudo usermod -aG docker YOUR_USERNAME
newgrp dockerhttpx.ConnectError on startuphttpx.ConnectError against either Postgres or a model endpoint.STORE_MODEL_IN_DB=True + DATABASE_URL in /etc/systemd/system/litellm.service.d/override.conf still pointing at a Postgres that no longer runs, or (2) litellm-config.yaml includes a model with api_base pointing at a dead port (e.g. localhost:8002 from a previous dual-model setup).PYTHONPATH (see Step 02). Audit litellm-config.yaml for any localhost:8002 or other dead endpoints and remove them — every model in model_list must have a live api_base.vllm_node container on spark-01 exits immediately. docker logs vllm_node shows repeated [Errno -3] Temporary failure in name resolution against huggingface.co, followed by huggingface_hub.errors.LocalEntryNotFoundError and an OSError about being unable to connect to HF to load files.huggingface.co to check for model config updates — even when the weights are already in the local cache. If Docker's internal DNS resolver hasn't recovered by the time the container starts (common immediately after a power cycle), the resolution fails and vLLM exits rather than falling back to the cached files.HF_HUB_OFFLINE: 1 and TRANSFORMERS_OFFLINE: 1 in the recipe env block (see Step 1e). These vars were added to qwen3.5-122b-fp8.yaml as the permanent fix. If the container is already failing and the recipe hasn't been patched yet, apply the patch and restart the service:
sudo systemctl restart vllm-cluster.service
sudo journalctl -u vllm-cluster.service -fvllm_node worker is running before restarting the service on spark-01 — the launcher SSHes into spark-02 to start the worker, and if the worker container is down it must be started first or the cluster won't form.
connection refused or connect timeout on every request. ~/spark-ai-stack/logs/litellm.log on spark-02 shows httpx.ConnectError against 198.51.100.1:8000.YOUR_NODE2_DAC_IP on port 8000 over enp1s0f0np0. Work through the chain:
ping -c 3 YOUR_NODE1_DAC_IP # DAC link reachable?
nc -zv YOUR_NODE1_DAC_IP 8000 # port open?docker ps | grep vllm_node # container running?
sudo ufw status | grep 8000 # firewall rule present?curl http://YOUR_NODE1_TAILSCALE_IP:8000/v1/models succeeds — meaning the client could bypass their LiteLLM and hit the unauthenticated vLLM directly.tailscale0 for port 8000, or (c) ufw is disabled.tag:owner:8000 entry anywhere. Only the explicit DAC-peer rule should appear for port 8000. If you find a leak, fix the ACL and ufw, then re-run the negative tests in the validation checklist.
sudo ufw status verbose | grep 8000
# Only this rule should appear for 8000:
# ALLOW IN enp1s0f0np0 from YOUR_NODE2_DAC_IP to any port 8000curl http://YOUR_NODE2_MGMT_IP:8001/v1/models from spark-01 and get the client's LiteLLM, and it works — leaking the client's API surface to your mgmt LAN.0.0.0.0:8001 so each is reachable from its own tailnet. Your firewall must restrict inbound on the mgmt-LAN interface to deny port 8001 cross-node — Tailscale ACLs alone don't help here because the mgmt LAN is not part of any tailnet.tailscale0 only. Confirm no rule allows 8001 on the wildcard or mgmt interface:
sudo ufw status verbose | grep 8001
# Should only show: ALLOW IN tailscale0 to any port 8001~/spark-ai-stack/logs/litellm.log on spark-01 that you didn't make. Or the client reports a request in their log that they didn't send.master_key values are unique on each node — they should never match. Then check each Open WebUI's API base URL: yours should be http://host.docker.internal:8001/v1 (your local LiteLLM); client's should be the same hostname (their local LiteLLM, not your tailnet IP). If you find cross-pointing, fix it and rotate both master keys.Qwen/Qwen3-235B-A22B-FP8 at TP=2 fails. Ray kills the worker with an OutOfMemoryError at the ~95% memory threshold during weight loading; the head logs report a placement-group failure shortly after.huggingface-cli is deprecatedhuggingface-cli download … prints a deprecation notice, or the command is missing on a fresh install of huggingface_hub.hf download <model> --local-dir <path>. Step 01 has been updated to use hf; if you have an older snippet around, swap the binary name and the flags map cleanly.preserve_thinking: true (or any other flag that exposes the visible CoT track). The Qwen3.5 family produces extended reasoning whenever thinking is enabled at template time, and conversational queries trigger excessive deliberation under that setting.--default-chat-template-kwargs '{"enable_thinking": false}' as a vLLM flag after the -- separator on the run-recipe.sh launch command (the Step 1e launch line does this). Users can opt into extended reasoning per-message by prefixing a prompt with /think. This matches the behavior the recommended Open WebUI system prompt (Step 05) is calibrated for.table public.LiteLLM_UserTable does not existLiteLLM_* table in the public schema.DATABASE_URL="postgresql://litellm:litellm@localhost:5432/litellm" \
prisma db push \
--schema /home/YOUR_USERNAME/.local/lib/python3.12/site-packages/litellm/proxy/schema.prisma
sudo systemctl restart litellmNot connected to DBlocalhost:5432.database_url is at the top level of litellm-config.yaml instead of nested under general_settings — LiteLLM only reads it from general_settings, or (b) the config still points at SQLite. SQLite is not supported for the UI; the Prisma schema is hardcoded for PostgreSQL.database_url under general_settings in litellm-config.yaml, then restart and confirm:
general_settings:
master_key: YOUR_MASTER_KEY
database_url: "postgresql://litellm:litellm@localhost:5432/litellm"sudo systemctl restart litellm
journalctl -u litellm -n 50 --no-pager
# Look for: Successfully connected to postgres DBLiteLLM_VerificationTokenTableVerificationToken not found error, but the key is still visible in the Virtual Keys tab.prisma db push, which dropped/recreated the verification token table).sudo systemctl restart litellmOther known issues
litellm.APIConnectionError: OpenAIException - The min_p and logit_bias
sampling parameters are not yet supported with speculative decoding.--speculative-config '{"method":"qwen3_next_mtp",...}' is active in the recipe. Open WebUI and Hermes send min_p and logit_bias sampling parameters by default. vLLM rejects any request containing these params when speculative decoding is enabled — there is no per-request fallback, so the entire API surface breaks simultaneously.This is a known vLLM bug affecting Qwen3.5-class models: vllm-project/vllm#35800
--speculative-config from the recipe, then restart the cluster:
sed -i '/speculative-config/d' ~/spark-vllm-docker/recipes/qwen3.5-122b-fp8.yaml
grep "speculative" ~/spark-vllm-docker/recipes/qwen3.5-122b-fp8.yaml || echo "Removed OK"docker stop vllm_node && docker rm vllm_nodesudo systemctl restart vllm-cluster.servicedocker logs -f vllm_node on spark-01.
cu130-nightly / eugr spark-vllm-docker. The NVFP4 recipe is marked solo_only: true in eugr/spark-vllm-docker, and vllm/vllm-openai:cu130-nightly fails at cluster launch when NVFP4 is selected with TP=2 multi-node.Qwen/Qwen3.5-122B-A10B-FP8 for multi-node inference — the Step 01 production recipe is built around this. See the NVFP4 stall entry in Cluster issues for the symptom on the bad path.[Errno 98] address already in usepkill -f litellm
sudo systemctl restart litellmsource ~/.bashrc
# or invoke directly:
/home/YOUR_USERNAME/.local/bin/hermes --versionsudo: hermes: command not foundsudo /home/YOUR_USERNAME/.local/bin/hermes gateway install --system
sudo systemctl start hermes-gateway
sudo systemctl status hermes-gateway --no-pagernpx not in LiteLLM's PATHnvm or a user-local installer, which places binaries in ~/.local/bin and adds that path to the user's shell rc file (.bashrc). LiteLLM — running as a systemd service — never sources .bashrc, so it gets a clean environment with no ~/.local/bin in PATH and cannot find npx when it tries to spawn the stdio MCP server process.sudo ln -sf /home/YOUR_USERNAME/.local/bin/node /usr/local/bin/node
sudo ln -sf /home/YOUR_USERNAME/.local/bin/npx /usr/local/bin/npxcurl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash -
sudo apt install -y nodejsvllm-cluster.service fails to start on spark-01. docker logs vllm_node shows address already in use :8000 or the Ray cluster comes up but the API server never binds.restart: always on port 8000 (e.g. a hand-run vllm-qwen container predating vllm-cluster.service) grabbed the port on boot before vllm-cluster.service fired. Any container with --restart=always on port 8000 will cause this.# Find what's holding port 8000
docker ps --filter "publish=8000" --format "table {{.Names}}\t{{.Image}}\t{{.Ports}}"
# If a non-vllm_node container appears, stop and remove it:
docker stop <container-name> && docker rm <container-name>
# Then start the cluster:
sudo systemctl restart vllm-cluster.service
docker logs -f vllm_nodeCorrect restart policies for this stack:
vllm_node → restart: no (systemd-managed — must never self-restart)n8n, open-webui, litellm-db → restart: unless-stopped
Post-outage startup order: (1) confirm spark-02
vllm_node is absent (docker ps | grep vllm_node), (2) spark-01: sudo systemctl restart vllm-cluster.service, (3) monitor: docker logs -f vllm_node.
~/.hermes/config.yaml, Hermes seems to ignore the changes — MCP is broken and the provider reverts to default. No obvious error in the logs..env values. No error is shown at startup, so the misconfiguration is invisible.YAML valid, fix the syntax before restarting.
python3 -c "import yaml; yaml.safe_load(open('/home/YOUR_USERNAME/.hermes/config.yaml')); print('YAML valid')"mcp_servers or a failed MCP client handshake. Logs reference an asyncio error or a TCP/stdio connection that never opened.mcp_servers block in ~/.hermes/config.yaml — it talks to LiteLLM's MCP endpoint over the LiteLLM API. A mcp_servers block in Hermes config is only correct in a direct-MCP (non-LiteLLM-proxied) topology, and on this stack it points Hermes at servers it can't reach.mcp_servers block from ~/.hermes/config.yaml entirely. MCP tool calls continue to work because LiteLLM is in the path. Validate and restart:
python3 -c "import yaml; yaml.safe_load(open('/home/YOUR_USERNAME/.hermes/config.yaml')); print('YAML valid')"
systemctl --user restart hermes~/.hermes/config.yaml.custom_providers must be indented exactly 2 spaces. A common mistake is using 4 spaces or no indentation.custom_providers:
- name: MyProvider
base_url: http://localhost:8001/v1
model: my-model
hermes command not found after installhermes: command not found.PATH="$HOME/.local/bin:$PATH" to ~/.bashrc. If you ran the installer via curl ... | bash in a non-login shell, the new PATH line is not active until you source it or open a new shell.source ~/.bashrc # or open a new shellhermes setup fails with import error — wrong Python picked uphermes setup exits immediately with a Python import error or ModuleNotFoundError.~/.hermes/hermes-agent/venv/ using Python 3.11 via uv. If the symlink at ~/.local/bin/hermes is broken or points elsewhere, the wrong Python is used.ls -la ~/.local/bin/hermes
# Should point to: ~/.hermes/hermes-agent/venv/bin/hermessudo systemctl status hermes-gateway shows a failed or activating state after reboot.hermes-gateway.service may start before the network is fully up.# Add to the [Unit] section of /etc/systemd/system/hermes-gateway.service:
# After=network-online.target
# Wants=network-online.target
sudo systemctl daemon-reload && sudo systemctl restart hermes-gatewaycd ~/.hermes/hermes-agent
git submodule update --init --recursive
~/.hermes/hermes-agent/venv/bin/pip install -e ./mini-swe-agentClustered Open WebUI / n8n (HA notes)
Both Open WebUI and n8n have HA modes available, but for a two-node home/lab setup the operational complexity is not worth it. This stack runs them as single instances on spark-02. If you ever want to pursue HA, here are the pointers.
Open WebUI HA
- Switch the
open-webuicontainer's storage from a Docker volume to a Postgres backend (env:DATABASE_URL=postgresql://...) and a shared filesystem for uploads and RAG documents. - Run multiple replicas behind a TCP load balancer. Sticky sessions are recommended for SSE chat streams.
- Postgres can sit on either node; if you put it on
spark-01you'll re-introduce the very latency-disturbance pattern this architecture is designed to avoid. Prefer a third small box or a dedicated HA pair.
n8n HA
- n8n's queue mode requires Postgres for state and Redis for the BullMQ queue. The main container becomes the main instance; one or more worker instances pull jobs off the queue.
- Set
EXECUTIONS_MODE=queue,QUEUE_BULL_REDIS_HOST=…,DB_TYPE=postgresdb, and the relevant Postgres env vars on every container. Replicas need the sameN8N_ENCRYPTION_KEY. - For a two-node setup the simplest variant is one main on
spark-02and one worker on a third small box, with Postgres + Redis colocated on the third box. - Webhook traffic should hit only the main container; long-running executions land on workers transparently.
Architecture
Obsidian vault sync runs on a dedicated Ubuntu 24.04 LXC (hostname: YOUR_LXC_HOSTNAME, IP: YOUR_LXC_IP) on the Proxmox homelab. Syncthing syncs the vault from Mac/Android to /vault on the LXC. @modelcontextprotocol/server-filesystem exposes the vault as a filesystem MCP server, wrapped by supergateway with streamableHttp transport on port 3000. LiteLLM on spark-01 connects to it at http://YOUR_LXC_IP:3000/mcp.
obsidian-mcp (StevenStavrakis) requires the Obsidian desktop app to be running in the LXC — not viable headless. @modelcontextprotocol/server-filesystem exposes the vault directory directly with no app dependency.
Obsidian (Mac/Android) <-> Syncthing <-> /vault on LXC <-> server-filesystem <-> supergateway :3000 <-> LiteLLM MCP client
LXC setup
| Setting | Value |
|---|---|
| OS | Ubuntu 24.04 LTS |
| Hostname | YOUR_LXC_HOSTNAME |
| IP | YOUR_LXC_IP |
| Vault path | /vault |
| Services | syncthing@root and obsidian-mcp — both enabled as systemd services |
LXC setup script
Run as root on a fresh Ubuntu 24.04 LXC.
#!/bin/bash
set -e
apt update
apt install -y curl gpg apt-transport-https
# Node.js 22
curl -fsSL https://deb.nodesource.com/setup_22.x | bash -
apt install -y nodejs
# Syncthing
curl -fsSL https://syncthing.net/release-key.gpg | \
gpg --dearmor -o /usr/share/keyrings/syncthing-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/syncthing-archive-keyring.gpg] \
https://apt.syncthing.net/ syncthing stable" \
> /etc/apt/sources.list.d/syncthing.list
apt update && apt install -y syncthing
# Vault directory
mkdir -p /vault
# Syncthing — After=network-online.target prevents a race where the service starts
# before the IP is assigned, causing the first vault sync to fail on boot
systemctl enable syncthing@root
systemctl start syncthing@root
sleep 8
# Expose Syncthing GUI on all interfaces
CONFIG_PATH=$(find /root -name "config.xml" 2>/dev/null | grep syncthing | head -1)
sed -i 's|<address>127.0.0.1:8384</address>|<address>0.0.0.0:8384</address>|' "$CONFIG_PATH"
systemctl restart syncthing@root
# @modelcontextprotocol/server-filesystem + supergateway
npm install -g @modelcontextprotocol/server-filesystem supergateway
# systemd service — streamableHttp transport, stateless (no --stateful), protocol 2024-11-05
cat > /etc/systemd/system/obsidian-mcp.service << 'EOF'
[Unit]
Description=Obsidian MCP Server (server-filesystem)
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=root
ExecStart=supergateway \
--stdio "npx -y @modelcontextprotocol/server-filesystem /vault" \
--port 3000 \
--outputTransport streamableHttp
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable obsidian-mcp
systemctl start obsidian-mcp
After the LXC is running
- Configure Syncthing on the LXC (port 8384) to accept a share from your Mac/Android
- Set the shared folder path to
/vault - Connect Mac Obsidian Syncthing client to the LXC Syncthing device
LiteLLM connection
Via litellm-config.yaml at the top level (not nested under litellm_settings). The protocol_version must be explicit — server-filesystem ignores 2025-11-25 and the handshake fails silently without it:
mcp_servers:
- name: obsidian
url: http://YOUR_LXC_IP:3000/mcp
transport: streamableHttp
protocol_version: "2024-11-05"
mcp-session-id header is needed. Clients must send Accept: application/json, text/event-stream — missing this returns -32000 Not Acceptable. LiteLLM sends the correct header automatically when transport: streamableHttp is set.
curl -s -H "Accept: application/json, text/event-stream" http://YOUR_LXC_IP:3000/mcp — expected: JSON with method: "initialize" response. A -32000 error means the Accept header is missing.Known issues
-32000 Not Acceptable on MCP POST requests/mcp return {"error": {"code": -32000, "message": "Not Acceptable"}}.Accept: application/json, text/event-stream header. LiteLLM sends it automatically — this error usually means a client (curl test, custom integration) is omitting it. Add -H "Accept: application/json, text/event-stream" to any manual curl calls.protocol_version: "2024-11-05" in the LiteLLM MCP server config. @modelcontextprotocol/server-filesystem does not respond correctly to 2025-11-25 — the handshake succeeds at the transport level but tool listing returns empty.After=network.target races IP assignment on boot. The setup script uses After=network-online.target + Wants=network-online.target — verify these are present in the running unit: systemctl cat syncthing@root | grep -E "After|Wants". If they are missing, edit /etc/systemd/system/syncthing@root.service or create a drop-in override.--outputTransport sse) breaks POST requests--outputTransport streamableHttp defaults supergateway to SSE, which does not handle /mcp POSTs. Always specify --outputTransport streamableHttp explicitly.Web Search Tool Use
Adds live web search as a callable tool — the AI model can run Brave Search queries during a conversation in response to tool calls from Open WebUI and other clients. LiteLLM spawns the server on demand via stdio using an API key you supply.
Step 1 — Get a Brave Search API key
Go to api.search.brave.com, create a free account, and generate an API key under the Data for AI plan (free tier supports up to 2,000 queries/month).
Step 2 — Confirm Node.js is installed system-wide
The MCP server is launched via npx. If you completed the Playwright MCP setup, Node.js is already installed system-wide and this step is done. Otherwise:
curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash -
sudo apt install -y nodejs
which npx — expected output: /usr/bin/npxStep 3 — Add Brave Search MCP server in LiteLLM UI
Navigate to http://YOUR_NODE1_MGMT_IP:8001/ui → MCP Servers → Add New MCP Server. (LiteLLM lives on spark-01.)
| Field | Value |
|---|---|
| Name | brave-search |
| Alias | brave-search |
| Transport Type | Standard Input/Output (stdio) |
Set Stdio Configuration (JSON) — replace YOUR_BRAVE_API_KEY with your actual key:
{
"command": "npx",
"args": [
"-y",
"@modelcontextprotocol/server-brave-search"
],
"env": {
"BRAVE_API_KEY": "YOUR_BRAVE_API_KEY"
}
}
Save and confirm Health Status shows Healthy.
npx resolves to /usr/bin/npx (system-wide install) and not a user-local path. See the Known Issues section — MCP stdio servers fail health check — for the full diagnosis.Validation
In Open WebUI, send the following prompt:
Expected: the model calls the brave_web_search tool (shown as "Explored" in Open WebUI) and returns a summary drawn from live search results.
Browser Automation
Adds browser automation tool use to the stack — the AI model can navigate pages, take screenshots, and scrape content via tool calls in Open WebUI and other clients. LiteLLM spawns a headless Chromium process on demand via stdio; no persistent port is required.
Step 1 — Install Node.js system-wide
LiteLLM runs as a systemd service and does not source .bashrc. Node.js must be installed system-wide so npx is available in LiteLLM's PATH.
curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash -
sudo apt install -y nodejs
which npx — expected output: /usr/bin/npxnvm or a user-local installer, this step replaces it with a system-wide install. The Known Issues section documents the npx PATH problem in detail.Step 2 — Install Playwright MCP Chromium browser
Chrome has no ARM64 build. Use Chromium, installed via the @playwright/mcp package's own browser installer — not via npx playwright install:
npx @playwright/mcp install-browser chromium
ls ~/.cache/ms-playwright/ — expected: a chromium-XXXX directory is present.Step 3 — Update litellm-config.yaml
Add model_info blocks to all model entries. Without these, LiteLLM does not advertise function calling support and tool calls will not execute.
model_list:
- model_name: qwen3.5-122b
litellm_params:
model: openai/qwen3.5-122b
api_base: http://localhost:8000/v1
api_key: "not-needed"
max_tokens: 8192
model_info:
supports_function_calling: true
supports_tool_choice: true
Restart LiteLLM after saving:
sudo systemctl restart litellm
Step 4 — Add Playwright MCP server in LiteLLM UI
Navigate to http://YOUR_NODE1_MGMT_IP:8001/ui → MCP Servers → Add New MCP Server. (LiteLLM lives on spark-01.)
| Field | Value |
|---|---|
| Name | playwright |
| Alias | playwright |
| Transport Type | Standard Input/Output (stdio) |
Set Stdio Configuration (JSON):
{
"command": "npx",
"args": [
"-y",
"@playwright/mcp@latest",
"--browser",
"chromium",
"--headless"
]
}
Save and confirm Health Status shows Healthy.
Validation
In Open WebUI, send the following prompt:
Expected: the model calls the navigate and screenshot tools (shown as "Explored" in Open WebUI) and returns a summary of the page.
Home Assistant tool integration
Home Assistant exposes its own MCP endpoint at /api/mcp using Streamable HTTP transport. This lets Hermes (via LiteLLM) call HA as a tool — controlling devices, querying states, reading automations.
Step 1 — Enable the integration in Home Assistant
Settings → Devices & Services → Add Integration → search "Model Context Protocol Server"
| Field | Value |
|---|---|
| Integration | Model Context Protocol Server |
| Endpoint path | /api/mcp (built into HA — no extra server needed) |
| Transport | Streamable HTTP |
Step 2 — Generate a long-lived access token
In Home Assistant: Profile → Long-Lived Access Tokens → Create Token. Copy the full token — it is only shown once.
API_ACCESS_TOKEN environment variable when configuring the MCP server in LiteLLM, not inline as a bearer token header — inline tokens get truncated by the LiteLLM UI field length limit.
Step 3 — Add to LiteLLM MCP config
mcp_servers:
- name: home-assistant
url: http://YOUR_HA_IP:8123/api/mcp
transport: streamableHttp
headers:
Authorization: "Bearer YOUR_HA_LONG_LIVED_TOKEN"
LiteLLM Admin UI
The LiteLLM proxy ships with a built-in web UI at /ui. It requires a master key and a PostgreSQL database — SQLite is not supported for the UI auth layer. The following documents every error encountered during setup, in order.
Step 1 — Set a master key
All commands below run on spark-01 (where LiteLLM lives). Add to ~/spark-ai-stack/litellm-config.yaml:
general_settings:
master_key: sk-yourkey
database_url: "postgresql://litellm:litellm@localhost:5432/litellm"
Generate a secure key:
echo "sk-$(openssl rand -hex 16)"
Step 2 — Add PostgreSQL to a compose file on spark-01
Run Postgres on the same node as LiteLLM. Putting it on spark-02 would re-introduce the very latency-disturbance pattern this architecture is designed to avoid. Add to a new ~/spark-ai-stack/litellm-db.yml on spark-01:
litellm-db:
image: postgres:16
container_name: litellm-db
restart: unless-stopped
environment:
- POSTGRES_USER=litellm
- POSTGRES_PASSWORD=litellm
- POSTGRES_DB=litellm
ports:
- "5432:5432"
volumes:
- litellm_db:/var/lib/postgresql/data
volumes:
litellm_db:
docker compose up -d litellm-db
restart: unless-stopped combined with sudo systemctl enable docker ensures the container survives reboots automatically — no additional systemd unit needed.Step 3 — Install Prisma
LiteLLM uses Prisma as its database ORM. It is not included in the base pip install litellm package.
pip install prisma --break-system-packages
--break-system-packages bypasses a Python 3.12 restriction that prevents pip from installing into the system Python environment. It is safe on a dedicated AI server where system tools do not depend on conflicting packages.
Step 4 — Generate Prisma binaries
After installing the package, the binaries must be generated from LiteLLM's bundled schema:
cd ~/.local/lib/python3.12/site-packages/litellm/proxy
prisma generate --schema schema.prisma
Step 5 — Apply the database schema
The Postgres database exists but has no tables yet. Push the schema. DATABASE_URL must be passed inline — Prisma reads it directly from the environment, not from litellm-config.yaml.
DATABASE_URL="postgresql://litellm:litellm@localhost:5432/litellm" \
prisma db push --schema schema.prisma
Step 6 — Restart LiteLLM
sudo systemctl daemon-reload
sudo systemctl restart litellm
sudo systemctl status litellm
Errors encountered in order
| Error | Cause | Fix |
|---|---|---|
Authentication Error, Not connected to DB | No PostgreSQL configured | Add database_url to general_settings |
ModuleNotFoundError: No module named 'prisma' | Prisma not installed | pip install prisma --break-system-packages |
Unable to find Prisma binaries | prisma generate not run | Run prisma generate --schema schema.prisma |
The table 'public.LiteLLM_UserTable' does not exist | Schema not applied to DB | Run prisma db push --schema schema.prisma |
Accessing the UI
Navigate to http://YOUR_NODE1_MGMT_IP:8001/ui. Username: admin. Password: your master_key value.
Project milestones
| Date | Milestone |
|---|---|
| May 1, 2026 | DGX Spark acquisition — Project Jiffy initiated |
| May 9–13 | Two-node clustering, 200 GbE/RoCE networking, vLLM bring-up |
| May 13 | INT4 → FP8 model migration; NCCL/RoCE fix → 18× throughput improvement (2–3 tok/s → 45 tok/s) |
| May 15 | MTP speculative decoding removed — unstable in vLLM v0.19.0, HTTP 500 on standard sampling params |
| May 16 | Hermes Agent deployed — Telegram gateway live |
| May 24–29 | n8n trading bot and market research workflows built |
| June 2 | Hermes desktop app v0.15.2 released |
| June 8 | Desktop app integration; dashboard basic auth; Home Assistant MCP confirmed working |
| June 11 | Obsidian MCP — obsidian-mcp → @modelcontextprotocol/server-filesystem migration |
| June 12 | Hermes memory limits raised (memory_char_limit 6000, user_char_limit 3000); dashboard auto-restart via ExecStartPost + sudoers |
| June 15 | Power outage recovery — orphaned vllm-qwen container (restart:always on port 8000) identified and removed |