NVIDIA DGX
At-Home AI Stack — split-trust shared compute

Two clustered Nvidia DGX Spark nodes (arm64, Ubuntu 24.04) sharing a 256 GB unified memory pool through tensor parallelism (TP=2) over Ray on a 200 Gb/s direct-attach copper interconnect — but with split ownership. spark-01 is your node (your LiteLLM, your Open WebUI, your Hermes Agent, your n8n, your Tailscale). spark-02 is the client's node (their LiteLLM, their Open WebUI, their n8n, their Tailscale). Both LiteLLM proxies talk to the shared vLLM endpoint at spark-01:8000 over the DAC; neither application stack sees the other. Read the Trust model section before deploying — this architecture has specific properties at the API layer that you should understand explicitly.

2× DGX Spark vLLM TP=2 · Ray Qwen3.5-122B-A10B-FP8 LiteLLM × 2 Open WebUI × 2 n8n × 2 Hermes Agent Tailscale × 2 (separate tailnets) DAC interconnect arm64 native

Architecture

Two physically separate DGX Spark nodes share a single tensor-parallel vLLM cluster (TP=2 over Ray on a 200 Gb/s DAC link) — but each node runs its own independent application stack owned by a different party. The diagram shows three logical layers: application stacks (top, separate per owner), LiteLLM proxies (middle, one per side, separate keys and logs), and the shared compute pool (bottom, TP=2 across both nodes, served by the vLLM head on spark-01:8000). Tailscale sits as a separate overlay on each node — the DAC link is its own private hardware and does not traverse Tailscale.

Two-layer split-trust architecture. Top: two independent application stacks. Left = spark-01 (your node) with Your Open WebUI, Your n8n, Your Hermes, and Your LiteLLM, on your Tailscale tailnet. Right = spark-02 (client node) with Client Open WebUI, Client n8n, and Client LiteLLM, on a separate client Tailscale tailnet. Both LiteLLM proxies feed into a single shared compute pool below: vLLM head (Ray master, TP rank 0) on spark-01:8000, vLLM Ray worker (TP rank 1) on spark-02, the Qwen model in 256 GB unified memory, connected via the DAC link (200 Gb/s, enp1s0f0np0, 198.51.100.0/30). Your LiteLLM calls vLLM over localhost:8000; client's LiteLLM calls it over the DAC at 198.51.100.1:8000. NCCL collectives between head and worker flow on the DAC. Tailscale overlays carry application traffic only; the DAC is private physical hardware not routed through either tailnet. Your users browser · Telegram · VS Code · over your tailnet Client's users browser · over their tailnet spark-01 · YOUR NODE 192.0.2.21 (mgmt) · 198.51.100.1 (DAC) Your Tailscale tailnet · private overlay · ACLs you control Your Open WebUI :8080 · your data Your n8n :5678 · your flows Your Hermes Telegram · skills · memory Your LiteLLM · :8001 your master_key · your SQLite log corpus api_base = http://localhost:8000/v1 ↑ from your apps ↓ to shared vLLM spark-02 · CLIENT NODE 192.0.2.22 (mgmt) · 198.51.100.2 (DAC) Client's Tailscale tailnet · separate overlay · ACLs client controls Client Open WebUI :8080 · client's data Client n8n :5678 · client's flows Client LiteLLM · :8001 client's master_key · client's SQLite log corpus api_base = http://198.51.100.1:8000/v1 (over DAC) ↑ from client apps ↓ to shared vLLM localhost:8000 198.51.100.1:8000 (over DAC) SHARED COMPUTE POOL · TP=2 over Ray single vLLM endpoint at 198.51.100.1:8000 — both LiteLLM proxies talk to it vLLM head (Ray master) · TP rank 0 · spark-01:8000 · :6379 (Ray GCS) --tensor-parallel-size 2 · --distributed-executor-backend ray VLLM_HOST_IP=198.51.100.1 · NCCL_IB_HCA=rocep1s0f0,roceP2p1s0f0 · GPU 0 vLLM Ray worker · TP rank 1 · spark-02 (no API listener) processes tensor activations only — no readable text VLLM_HOST_IP=198.51.100.2 · NCCL_IB_HCA=rocep1s0f0,roceP2p1s0f0 · GPU 1 NCCL Qwen/Qwen3.5-122B-A10B-FP8 — 256 GB unified memory pool bootstrap fallback: Qwen/Qwen3.6-35B-A3B-FP8 DAC · 200 Gb/s direct-attach copper enp1s0f0np0 · MTU 9216 · 198.51.100.0/30 NCCL allreduce over RoCE/RDMA · Ray control · NOT routed through either tailnet · separate from mgmt LAN

Trust model

Read this before deploying. The split-trust architecture has specific properties at the API layer that you should understand explicitly. None of this is a new risk introduced by the cluster — it is just the same trust profile you accept any time you use a hosted inference API, made visible.

API-layer visibility — spark-01 sees all prompts

The vLLM head process runs on spark-01 and serves the OpenAI-compatible API on port 8000. Both LiteLLM proxies — yours and the client's — call this endpoint. That means the owner of spark-01 can, in principle, observe every raw prompt and every model output that crosses the API surface. This is structurally identical to the trust profile of any commercial hosted-inference provider (OpenAI, Anthropic, Together, etc.): the entity running the API server can see traffic at the API layer.

Tensor-layer isolation — spark-02 sees only floats

The Ray worker on spark-02 processes tensor activations, not text. It receives intermediate floating-point tensors over NCCL allreduce on the DAC link and contributes its share of the matrix multiplications. The client's node never sees readable prompts or completions; it only sees the mathematical operations its TP rank is responsible for. NCCL traffic on the DAC carries floats, not strings.

Application-layer isolation — fully separate stacks

Knowledge bases, chat history, RAG pipelines, vector indexes, API keys, request logs, and OAuth tokens are completely separate on each node. Your Open WebUI's database is on spark-01; the client's is on spark-02. Your LiteLLM master key is yours; the client's is the client's. Neither party has access to the other's application stack — there is no cross-mounted volume, no shared Postgres, no shared file system. The only thing that crosses the boundary is the inference call from the client's LiteLLM into spark-01:8000.

Network isolation — separate tailnets, private DAC

Each node joins its owner's Tailscale tailnet independently. ACLs on each tailnet are controlled by that owner. The DAC link (198.51.100.0/30) is private physical hardware between the two nodes — it is not routed through either Tailscale network and is not advertised on either tailnet. Tailscale carries application traffic only (clients reaching their own UIs); compute traffic stays on the DAC.

When this architecture is appropriate

  • Both parties have a working relationship and have agreed to this arrangement.
  • The data being sent for inference is not regulated (HIPAA / GDPR / SOC2 / PCI / etc.).
  • Both parties accept a trust model equivalent to using any commercial hosted-inference API.

When additional agreements are required

  • Either party handles regulated data — HIPAA, GDPR, SOC2, PCI-DSS, attorney-client privileged, or similar — in which case a written data processing agreement (DPA / BAA / equivalent) and audit controls are needed before traffic flows.
  • Either party has contractual data handling requirements imposed by their own customers or regulators.
  • The relationship is not pre-existing and the trust profile of "any hosted inference API" is not acceptable.
vLLM on spark-01:8000 has no authentication. Tailscale ACLs and host firewall rules are what prevent the client from bypassing their LiteLLM and hitting the unauthenticated endpoint directly. See Step 06 (Tailscale) for the ACL configuration that enforces this.

Hardware topology

NodeOwnerMgmt IPDAC IPServices
spark-01 You (private) 192.0.2.21 198.51.100.1 vLLM head (Ray master, TP rank 0), Your LiteLLM, Your Open WebUI, Your Hermes Agent, Your n8n, Your Tailscale
spark-02 Client (separate ownership) 192.0.2.22 198.51.100.2 vLLM Ray worker (TP rank 1), Client LiteLLM, Client Open WebUI, Client n8n, Client Tailscale

Interconnects

  • DAC interconnectenp1s0f0np0, MTU 9216, point-to-point 198.51.100.0/30. Carries NCCL for tensor-parallel collectives, Ray control, and the client LiteLLM's inference calls into spark-01:8000. Not routed through either Tailscale network.
  • Mgmt interconnect192.0.2.0/24 over RJ45, default routes. Used for SSH and node-bootstrap traffic during setup.
  • Tailscale (each owner) — each node independently joins its owner's tailnet. Application traffic (browser → Open WebUI, Telegram → Hermes, etc.) traverses Tailscale. The DAC link is never advertised onto either tailnet.
  • SSH — passwordless both directions between spark-01 and spark-02 at the mgmt IPs (required for the rsync step in Step 01). After setup, this can be locked down or removed.

Architecture principles

  1. Shared compute, split application. The vLLM cluster is the only shared resource. Application stacks above it (LiteLLM, Open WebUI, Hermes, n8n, Tailscale) are duplicated and independently owned.
  2. Both LiteLLMs hit the same vLLM endpoint. Your LiteLLM uses http://localhost:8000/v1; client's LiteLLM uses http://198.51.100.1:8000/v1 over the DAC. Neither proxy goes through the other's stack.
  3. Separate keys, separate logs, separate data. Each LiteLLM has its own master key; each Open WebUI has its own knowledge bases and chat history. Nothing is shared at the application layer.
  4. Tailscale is per-owner. Two separate tailnets, two separate ACL policies. Cross-tailnet traffic only happens if both owners explicitly configure it (which by default they do not).
  5. Single-instance per side. No clustered Open WebUI / clustered n8n on either node. HA modes are documented in the appendix only.

Network worksheet

Fill these in once. Every code block on this page that contains a matching placeholder (YOUR_NODE1_MGMT_IP, YOUR_USERNAME, etc.) will be live-substituted with the value you type — and a yellow highlight shows you what was filled in. Values are saved to your browser's localStorage so reloads keep them. Master keys, API keys, and other secrets are deliberately not in this worksheet — fill those into the relevant code blocks manually so they never touch localStorage.

Network worksheet — your IP slots Two boxes representing spark-01 (your node) and spark-02 (client node) with three IP slots each: mgmt, DAC, and Tailscale. The DAC link is shown between the two nodes. The mgmt LAN and the two tailnets are shown as separate networks each node attaches to. spark-01 — YOUR NODE mgmt IP YOUR_NODE1_MGMT_IP DAC IP YOUR_NODE1_DAC_IP Tailscale IP YOUR_NODE1_TAILSCALE_IP DAC · 200 Gb/s spark-02 — CLIENT NODE mgmt IP YOUR_NODE2_MGMT_IP DAC IP YOUR_NODE2_DAC_IP Tailscale IP YOUR_NODE2_TAILSCALE_IP Mgmt LAN · Your tailnet Mgmt LAN · Client's tailnet Your tailnet hostname (used in n8n WEBHOOK_URL) YOUR_TAILNET_HOSTNAME

spark-01 — your node

spark-02 — client node

Shared / per-host

Not saved
Secrets stay manual. YOUR_MASTER_KEY, YOUR_CLIENT_MASTER_KEY, and YOUR_BRAVE_API_KEY are intentionally not in this worksheet — fill those into the relevant code blocks by hand, and don't paste them into a browser-stored field. The worksheet only handles network identifiers and your username.

TL;DR — one-shot setup scripts

Fill the table below, then run the matching script on each node. The scripts bundle every step in this guide — packages, Docker, RoCE/DAC checks, vLLM cluster image, model download + DAC rsync, LiteLLM + Postgres + Prisma, Open WebUI, n8n, Hermes, Tailscale, and host firewall — into a single idempotent run per node. spark-01 must complete through the image-copy/rsync stage before spark-02 can finish; the spark-02 script will pause and wait for the model weights to arrive.

Secrets are session-only. Values entered below in the Secrets block (HF token, master keys, Tailscale auth keys) are kept in browser sessionStorage — they vanish when you close the tab and are never written to disk. Generate fresh master keys with openssl rand -hex 24.
Tailscale auth-key prerequisites. Before generating each auth key, in that tailnet's admin console:
  1. Open Access Controls and ensure the tag is declared in tagOwners"tag:owner": ["you@example.com"] for spark-01, "tag:client-ai": ["autogroup:admin"] for spark-02. The tag must exist before any device tries to advertise it.
  2. Open Settings → Keys → Generate auth key. Toggle Tags on and select the matching tag from the list. Recommended: Reusable: no, Ephemeral: no, Tags: tag:owner (or tag:client-ai).
Auth keys without a tag selection will be rejected by --advertise-tags at tailscale up time. Tagged devices have key expiry automatically disabled, so the server won't drop off the tailnet on the 90-day timer.

spark-01 — your node · secrets & choices

spark-02 — client node · secrets & choices

Model + recipe (shared)

Optional layers — uncheck to skip

Core stack — vLLM cluster, LiteLLM, Postgres, Docker, RoCE check — is always installed. Tailscale ships pre-installed on DGX Spark, so uncheck it if you've already configured it (or want to use your existing config). Unchecking a layer also removes its ufw port rule.

spark-01 layers

spark-02 layers

Network worksheet drives the IP/username substitutions above

Run order — start both scripts in parallel; the dependencies are baked in as wait loops on each side:

  1. Start setup-spark-02.sh first (or simultaneously). spark-01 cannot push the vLLM image until Docker is installed on spark-02; the spark-01 script will wait for it.
  2. Then start setup-spark-01.sh on spark-01. It installs packages, waits up to 5 min for spark-02's Docker to come online over the DAC, builds and pushes the ~19 GB vLLM image to spark-02, downloads the model on spark-01, rsyncs the ~60 GB of weights to spark-02 over the DAC, and brings up the rest of your stack.
  3. Meanwhile, setup-spark-02.sh waits up to 10 min for the image, then up to 30 min for the weights, then brings up the client LiteLLM, Open WebUI, n8n, Tailscale, and ufw on its side.
  4. When both scripts return, run hermes setup interactively on spark-01 — the wizard prompts for the Telegram bot token and user ID, so it can't be safely scripted.
  5. Apply the Tailscale ACLs from Step 06 in each owner's admin console.
spark-01 — your node (head)
bash · setup-spark-01.shsubstituted live from worksheet
#!/usr/bin/env bash
# DGX AI Stack — spark-01 (owner / Ray head) one-shot setup
# Bundles: bootstrap · GUI disable · RoCE check · vLLM image build + DAC copy ·
# model download + rsync · LiteLLM + Postgres + Prisma · Open WebUI · n8n ·
# Hermes install · Hermes dashboard · Tailscale · ufw.
# Re-running is safe: every step is guarded.
set -euo pipefail

# ───── worksheet-substituted config ─────
NODE1_MGMT_IP="YOUR_NODE1_MGMT_IP"
NODE1_DAC_IP="YOUR_NODE1_DAC_IP"
NODE2_MGMT_IP="YOUR_NODE2_MGMT_IP"
NODE2_DAC_IP="YOUR_NODE2_DAC_IP"
USERNAME="YOUR_USERNAME"
TAILNET_HOSTNAME="YOUR_TAILNET_HOSTNAME"
HF_TOKEN="YOUR_HF_TOKEN"
MASTER_KEY="YOUR_MASTER_KEY"
TS_AUTHKEY="YOUR_NODE1_TS_AUTHKEY"
MODEL_RECIPE="MODEL_RECIPE"
SERVED_MODEL_NAME="SERVED_MODEL_NAME"
HF_MODEL_REPO="HF_MODEL_REPO"
HF_MODEL_DIR="HF_MODEL_DIR"

log() { echo -e "\n\033[1;34m▶ $*\033[0m"; }
die() { echo "✗ $*" >&2; exit 1; }

[[ "$EUID" -ne 0 ]] || die "do not run as root — run as $USERNAME"
[[ "$(whoami)" == "$USERNAME" ]] || die "expected user $USERNAME, got $(whoami)"

# Fail-fast if critical worksheet placeholders weren't filled.
[[ "$MASTER_KEY" != "YOUR_MASTER_KEY" && -n "$MASTER_KEY" ]] || die "MASTER_KEY not set — fill the TL;DR secrets block in the guide"
[[ "$HF_TOKEN"   != "YOUR_HF_TOKEN"   && -n "$HF_TOKEN"   ]] || die "HF_TOKEN not set — fill the TL;DR secrets block in the guide"
[[ "$NODE1_DAC_IP" != "YOUR_NODE1_DAC_IP" && "$NODE2_DAC_IP" != "YOUR_NODE2_DAC_IP" ]] || die "fill the DAC IPs in the network worksheet"
[[ "$USERNAME" != "YOUR_USERNAME" ]] || die "fill USERNAME in the network worksheet"

# ───── 1. base packages + docker ─────
log "apt — base packages"
sudo apt-get update
sudo apt-get install -y ca-certificates curl gnupg lsb-release git rsync ufw \
  nload jq python3-pip python3-venv openssh-client openssl

if ! command -v docker >/dev/null 2>&1; then
  curl -fsSL https://get.docker.com | sudo sh
fi
sudo systemctl enable --now docker
id -nG "$USERNAME" | grep -qw docker || sudo usermod -aG docker "$USERNAME"

# Re-exec under the docker group if this shell doesn't have it yet
if ! id -nG | grep -qw docker; then
  log "re-execing under docker group"
  SCRIPT="$(readlink -f "$0")"
  exec sg docker -c "bash '$SCRIPT' $*"
fi

# ───── 2. /etc/hosts — anchor hostnames to mgmt IPs ─────
log "/etc/hosts entries"
sudo sed -i '/[[:space:]]spark-01$/d;/[[:space:]]spark-02$/d' /etc/hosts
echo "$NODE1_MGMT_IP  spark-01" | sudo tee -a /etc/hosts >/dev/null
echo "$NODE2_MGMT_IP  spark-02" | sudo tee -a /etc/hosts >/dev/null

# ───── 3. headless: stop and disable GDM ─────
log "disable desktop"
sudo systemctl set-default multi-user.target
sudo systemctl stop gdm 2>/dev/null || true
sudo systemctl stop gnome-remote-desktop 2>/dev/null || true
sudo systemctl disable gnome-remote-desktop 2>/dev/null || true

# ───── 4. RoCE / DAC sanity ─────
log "RoCE check"
ibdev2netdev || die "ibdev2netdev failed — RoCE drivers missing"
ls /dev/infiniband/ >/dev/null || die "/dev/infiniband missing — fix RoCE plumbing first"

# ───── 5. passwordless SSH to spark-02 ─────
log "SSH access to spark-02 (mgmt + DAC)"
[[ -f "$HOME/.ssh/id_ed25519" ]] || ssh-keygen -t ed25519 -N "" -f "$HOME/.ssh/id_ed25519"

# Probe with BatchMode — succeeds silently if a key is already authorized.
probe_ssh() {
  ssh -o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=accept-new \
      "$USERNAME@$1" 'exit 0' >/dev/null 2>&1
}

# Deploy the key to one target, but only if probing fails.
# Reads password from /dev/tty so this works even when stdin is a pipe
# (e.g. `curl ... | bash`). Falls through with a warning if the user
# skips or if password auth is disabled on spark-02.
deploy_key() {
  local host="$1" label="$2"
  if probe_ssh "$host"; then
    echo "  ✓ $label ($host) — key already authorized, no password needed"
    return 0
  fi
  echo
  echo "  → $label ($host) — SSH key not yet authorized."
  echo "    ssh-copy-id will prompt for $USERNAME's password on spark-02."
  echo "    Press Ctrl-D to skip if password auth is disabled there (you will then"
  echo "    need to deploy ~/.ssh/id_ed25519.pub to spark-02 manually before re-running)."
  if [[ -r /dev/tty ]]; then
    ssh-copy-id -o StrictHostKeyChecking=accept-new "$USERNAME@$host" </dev/tty \
      || echo "  (ssh-copy-id failed or was skipped for $label)"
  else
    echo "  (no controlling terminal — cannot prompt for password; deploy key manually)"
  fi
  if probe_ssh "$host"; then
    echo "  ✓ $label ($host) — key now authorized"
  else
    echo "  ✗ $label ($host) — still no passwordless SSH (will retry in step 6)"
  fi
}

deploy_key "$NODE2_MGMT_IP" "spark-02 mgmt"
deploy_key "$NODE2_DAC_IP"  "spark-02 DAC"

# ───── 6. wait for spark-02 to have Docker (build-and-copy.sh needs `docker load` there) ─────
log "waiting for spark-02 Docker over DAC ($NODE2_DAC_IP)"
SSH2="ssh -o BatchMode=yes -o StrictHostKeyChecking=accept-new $USERNAME@$NODE2_DAC_IP"
for i in {1..60}; do  # 5 min cap
  $SSH2 'command -v docker' >/dev/null 2>&1 && break
  [[ $((i % 6)) -eq 0 ]] && echo "  …spark-02 not ready (${i}/60) — start setup-spark-02.sh there if you haven't"
  sleep 5
done
$SSH2 'command -v docker' >/dev/null 2>&1 || die "spark-02 unreachable or Docker missing — run setup-spark-02.sh on the other node first"

# ───── 7. vLLM image — clone + build + copy to spark-02 over DAC ─────
log "spark-vllm-docker"
cd "$HOME"
[[ -d spark-vllm-docker ]] || git clone https://github.com/eugr/spark-vllm-docker.git
cd spark-vllm-docker
./build-and-copy.sh --tf5 --copy-to "$NODE2_DAC_IP"

# ───── 8. autodiscovery → .env (accept defaults) ─────
log "discovery"
# Feed blank answers to all prompts; trailing `|| true` survives SIGPIPE under pipefail.
{ printf '\n%.0s' {1..20} | ./run-recipe.sh --discover; } || true
grep -q '^CONTAINER_HF_TOKEN=' .env || echo "CONTAINER_HF_TOKEN=$HF_TOKEN" >> .env

# ───── 9. model weights — download then rsync to spark-02 over DAC ─────
log "huggingface download"
pip3 install --break-system-packages 'huggingface_hub[cli]' >/dev/null
export PATH="$HOME/.local/bin:$PATH"  # so `hf` resolves on this run
HF_TOKEN="$HF_TOKEN" hf download "$HF_MODEL_REPO" \
  --local-dir "$HOME/.cache/huggingface/hub/$HF_MODEL_DIR/"

log "rsync model → spark-02 over DAC ($NODE1_DAC_IP → $NODE2_DAC_IP)"
$SSH2 "mkdir -p ~/.cache/huggingface/hub/$HF_MODEL_DIR"
rsync -avP -e "ssh -b $NODE1_DAC_IP" \
  "$HOME/.cache/huggingface/hub/$HF_MODEL_DIR/" \
  "$NODE2_DAC_IP:.cache/huggingface/hub/$HF_MODEL_DIR/"

# ───── 10. LiteLLM + Prisma ─────
log "pip — litellm[proxy] + prisma"
pip3 install --break-system-packages 'litellm[proxy]' prisma >/dev/null
grep -q 'HOME/.local/bin' "$HOME/.bashrc" || \
  echo 'export PATH="$HOME/.local/bin:$PATH"' >> "$HOME/.bashrc"

# ───── 11. spark-ai-stack — postgres compose ─────
log "spark-ai-stack directory + postgres"
mkdir -p "$HOME/spark-ai-stack/logs" "$HOME/workspace"
cd "$HOME/spark-ai-stack"

cat > docker-compose.yml <<'YAML'
services:
  litellm-db:
    image: postgres:16
    container_name: litellm-db
    restart: unless-stopped
    environment:
      - POSTGRES_USER=litellm
      - POSTGRES_PASSWORD=litellm
      - POSTGRES_DB=litellm
    volumes:
      - litellm_db:/var/lib/postgresql/data
    ports:
      - "5432:5432"
volumes:
  litellm_db:
YAML

docker compose up -d litellm-db
for i in {1..40}; do
  docker exec litellm-db pg_isready -U litellm >/dev/null 2>&1 && break
  sleep 2
done
docker exec litellm-db pg_isready -U litellm >/dev/null 2>&1 \
  || die "postgres (litellm-db) never became ready — check 'docker logs litellm-db'"

# ───── 12. litellm-config.yaml (owner) ─────
log "litellm-config.yaml"
cat > "$HOME/spark-ai-stack/litellm-config.yaml" <<EOF
model_list:
  - model_name: Qwen3.5-122B-Non-Reasoning
    litellm_params:
      model: openai/$SERVED_MODEL_NAME
      api_base: http://localhost:8000/v1
      api_key: "not-needed"
      max_tokens: 8192
      extra_body:
        chat_template_kwargs:
          enable_thinking: false
    model_info:
      supports_function_calling: true
      supports_tool_choice: true
      max_context_window: 262144
      max_input_tokens: 229376
      max_output_tokens: 8192
  - model_name: Qwen3.5-122B-Reasoning
    litellm_params:
      model: openai/$SERVED_MODEL_NAME
      api_base: http://localhost:8000/v1
      api_key: "not-needed"
      max_tokens: 32768
      extra_body:
        chat_template_kwargs:
          enable_thinking: true
    model_info:
      supports_function_calling: true
      supports_tool_choice: true
      max_context_window: 262144
      max_input_tokens: 229376
      max_output_tokens: 32768
litellm_settings:
  verbose: true
  store_model_in_db: true
  default_system_message: "You are a highly capable AI assistant. Be direct, accurate, and concise. Answer immediately without preamble. For coding: produce complete, working code. State definitive answers first then explain."
router_settings:
  num_retries: 0
  timeout: 600
general_settings:
  master_key: $MASTER_KEY
  database_url: "postgresql://litellm:litellm@localhost:5432/litellm"
  mcp_settings:
    allow_all_keys: true
EOF

log "prisma db push"
PRISMA_SCHEMA="$(python3 -c 'import litellm,os;print(os.path.join(os.path.dirname(litellm.__file__),"proxy","schema.prisma"))')"
DATABASE_URL="postgresql://litellm:litellm@localhost:5432/litellm" \
  "$HOME/.local/bin/prisma" db push --schema "$PRISMA_SCHEMA"

# ───── 13. systemd — vllm-cluster + litellm ─────
log "systemd units"
sudo tee /etc/systemd/system/vllm-cluster.service >/dev/null <<EOF
[Unit]
Description=vLLM Cluster - $SERVED_MODEL_NAME
After=network-online.target docker.service
Wants=network-online.target
Requires=docker.service

[Service]
Type=oneshot
RemainAfterExit=yes
User=$USERNAME
WorkingDirectory=/home/$USERNAME/spark-vllm-docker
ExecStartPre=/bin/sleep 30
ExecStart=/home/$USERNAME/spark-vllm-docker/run-recipe.sh $MODEL_RECIPE -d -- --served-model-name $SERVED_MODEL_NAME --gpu-memory-utilization 0.80
TimeoutStartSec=600
TimeoutStopSec=60

[Install]
WantedBy=multi-user.target
EOF

sudo tee /etc/systemd/system/litellm.service >/dev/null <<EOF
[Unit]
Description=LiteLLM Proxy (owner)
After=network.target docker.service
Wants=docker.service

[Service]
Type=simple
User=$USERNAME
WorkingDirectory=/home/$USERNAME/spark-ai-stack
ExecStart=/home/$USERNAME/.local/bin/litellm --config litellm-config.yaml --port 8001 --host 0.0.0.0
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable vllm-cluster litellm
# --no-block: oneshot units block on ExecStart; model load takes minutes.
sudo systemctl start --no-block vllm-cluster
sleep 10
sudo systemctl start litellm

# >>>OPT:openwebui
# ───── 14. Open WebUI ─────
log "open-webui container"
docker rm -f open-webui 2>/dev/null || true
docker run -d --name open-webui --restart unless-stopped \
  -p 8080:8080 \
  -v open-webui:/app/backend/data \
  -e OPENAI_API_BASE_URL="http://host.docker.internal:8001/v1" \
  -e OPENAI_API_KEY="$MASTER_KEY" \
  -e WEBUI_AUTH=True \
  -e ENABLE_OLLAMA_API=False \
  --add-host=host.docker.internal:host-gateway \
  ghcr.io/open-webui/open-webui:main
# <<>>OPT:n8n
# ───── 15. n8n compose ─────
log "n8n compose"
# Fall back to plain 'spark-01' if the worksheet hostname is blank or unfilled.
WEBHOOK_HOST="$TAILNET_HOSTNAME"
[[ -z "$WEBHOOK_HOST" || "$WEBHOOK_HOST" == "YOUR_TAILNET_HOSTNAME" ]] && WEBHOOK_HOST=spark-01
cat > "$HOME/spark-ai-stack/n8n.yml" <<EOF
services:
  n8n:
    image: n8nio/n8n:latest
    container_name: n8n
    restart: unless-stopped
    ports:
      - "5678:5678"
    environment:
      - N8N_HOST=0.0.0.0
      - N8N_PORT=5678
      - N8N_PROTOCOL=http
      - WEBHOOK_URL=http://$WEBHOOK_HOST:5678/
      - N8N_SECURE_COOKIE=false
      - NODE_ENV=production
      - GENERIC_TIMEZONE=America/Los_Angeles
    volumes:
      - n8n_data:/home/node/.n8n
    extra_hosts:
      - "host.docker.internal:host-gateway"
volumes:
  n8n_data:
EOF
docker compose -f "$HOME/spark-ai-stack/n8n.yml" up -d
# <<>>OPT:hermes
# ───── 16. Hermes Agent + Dashboard ─────
log "hermes install (run 'hermes setup' interactively after this script)"
sudo apt install -y ripgrep ffmpeg
curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash || true

# Pre-create state dirs.
mkdir -p "$HOME/.hermes" "$HOME/workspace"

sudo tee /etc/systemd/system/hermes-dashboard.service >/dev/null <<EOF
[Unit]
Description=Hermes Agent Dashboard
After=network.target hermes-gateway.service
Wants=hermes-gateway.service

[Service]
Type=simple
User=$USERNAME
ExecStart=/home/$USERNAME/.local/bin/hermes dashboard --port 9119 --host 0.0.0.0 --insecure --no-open
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable hermes-dashboard
sudo systemctl start hermes-dashboard
# <<>>OPT:tailscale
# ───── 17. Tailscale (owner tailnet) ─────
# Prereqs in the OWNER tailnet:
#   1. tag:owner must be declared in tagOwners in the policy file.
#   2. The auth key must be generated WITH tag:owner selected (Admin → Settings → Keys).
# DGX Spark may ship with Tailscale pre-installed and possibly pre-joined; install.sh
# is idempotent and upgrades in place, and `tailscale logout` ensures a clean re-join.
log "tailscale install + up"
curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale logout 2>/dev/null || true
if [[ -n "$TS_AUTHKEY" && "$TS_AUTHKEY" != "YOUR_NODE1_TS_AUTHKEY" ]]; then
  sudo tailscale up --auth-key="$TS_AUTHKEY" --hostname=spark-01 --advertise-tags=tag:owner
else
  echo "  (TS_AUTHKEY blank — run: sudo tailscale up --hostname=spark-01 --advertise-tags=tag:owner)"
fi
# <<>>OPT:ufw
# ───── 18. host firewall ─────
log "ufw"
sudo ufw --force default deny incoming
sudo ufw default allow outgoing
# >>>OPT:tailscale
sudo ufw allow in on tailscale0 to any port 8001 proto tcp
# >>>OPT:openwebui
sudo ufw allow in on tailscale0 to any port 8080 proto tcp
# <<>>OPT:n8n
sudo ufw allow in on tailscale0 to any port 5678 proto tcp
# <<>>OPT:hermes
sudo ufw allow in on tailscale0 to any port 9119 proto tcp
# <<
spark-02 — client node (worker)
bash · setup-spark-02.shsubstituted live from worksheet
#!/usr/bin/env bash
# DGX AI Stack — spark-02 (client / Ray worker) one-shot setup
# Bundles: bootstrap · GUI disable · RoCE check · wait-for-image · wait-for-weights ·
# client LiteLLM + Postgres + Prisma · client Open WebUI · client n8n ·
# Tailscale (client tailnet) · ufw.
# Re-running is safe: every step is guarded.
set -euo pipefail

# ───── worksheet-substituted config ─────
NODE1_MGMT_IP="YOUR_NODE1_MGMT_IP"
NODE1_DAC_IP="YOUR_NODE1_DAC_IP"
NODE2_MGMT_IP="YOUR_NODE2_MGMT_IP"
NODE2_DAC_IP="YOUR_NODE2_DAC_IP"
USERNAME="YOUR_USERNAME"
CLIENT_TAILNET_HOSTNAME="CLIENT_TAILNET_HOSTNAME"
CLIENT_MASTER_KEY="YOUR_CLIENT_MASTER_KEY"
TS_AUTHKEY="YOUR_NODE2_TS_AUTHKEY"
SERVED_MODEL_NAME="SERVED_MODEL_NAME"
HF_MODEL_DIR="HF_MODEL_DIR"

log() { echo -e "\n\033[1;35m▶ $*\033[0m"; }
die() { echo "✗ $*" >&2; exit 1; }

[[ "$EUID" -ne 0 ]] || die "do not run as root — run as $USERNAME"
[[ "$(whoami)" == "$USERNAME" ]] || die "expected user $USERNAME, got $(whoami)"

# Fail-fast if critical worksheet placeholders weren't filled.
[[ "$CLIENT_MASTER_KEY" != "YOUR_CLIENT_MASTER_KEY" && -n "$CLIENT_MASTER_KEY" ]] || die "CLIENT_MASTER_KEY not set — fill the TL;DR secrets block in the guide"
[[ "$NODE1_DAC_IP" != "YOUR_NODE1_DAC_IP" && "$NODE2_DAC_IP" != "YOUR_NODE2_DAC_IP" ]] || die "fill the DAC IPs in the network worksheet"
[[ "$USERNAME" != "YOUR_USERNAME" ]] || die "fill USERNAME in the network worksheet"

# ───── 1. base packages + docker ─────
log "apt — base packages"
sudo apt-get update
sudo apt-get install -y ca-certificates curl gnupg lsb-release git rsync ufw \
  nload jq python3-pip python3-venv openssh-server openssl netcat-openbsd

if ! command -v docker >/dev/null 2>&1; then
  curl -fsSL https://get.docker.com | sudo sh
fi
sudo systemctl enable --now docker ssh
id -nG "$USERNAME" | grep -qw docker || sudo usermod -aG docker "$USERNAME"

# Re-exec under the docker group if this shell doesn't have it yet
if ! id -nG | grep -qw docker; then
  log "re-execing under docker group"
  SCRIPT="$(readlink -f "$0")"
  exec sg docker -c "bash '$SCRIPT' $*"
fi

# ───── 2. /etc/hosts ─────
log "/etc/hosts"
sudo sed -i '/[[:space:]]spark-01$/d;/[[:space:]]spark-02$/d' /etc/hosts
echo "$NODE1_MGMT_IP  spark-01" | sudo tee -a /etc/hosts >/dev/null
echo "$NODE2_MGMT_IP  spark-02" | sudo tee -a /etc/hosts >/dev/null

# ───── 3. headless ─────
log "disable desktop"
sudo systemctl set-default multi-user.target
sudo systemctl stop gdm 2>/dev/null || true
sudo systemctl stop gnome-remote-desktop 2>/dev/null || true
sudo systemctl disable gnome-remote-desktop 2>/dev/null || true

# ───── 4. RoCE / DAC sanity ─────
log "RoCE check"
ibdev2netdev || die "ibdev2netdev failed — RoCE drivers missing"
ls /dev/infiniband/ >/dev/null || die "/dev/infiniband missing"

# ───── 5. wait for spark-01 to push the vLLM image (~19 GB over 200 Gb/s DAC ≈ 1–2 min) ─────
log "waiting for vllm-node-tf5 image (pushed from spark-01 by build-and-copy.sh)"
for i in {1..120}; do  # 120 × 5s = 10 min cap
  docker images --format '{{.Repository}}:{{.Tag}}' | grep -q '^vllm-node-tf5:latest$' && break
  [[ $((i % 12)) -eq 0 ]] && echo "  …still waiting (${i}/120)"
  sleep 5
done
docker images | grep -q vllm-node-tf5 || die "vllm-node-tf5 image never arrived from spark-01 — re-run build-and-copy.sh there"

# ───── 6. wait for model weights via rsync from spark-01 (~60 GB over DAC ≈ 4–8 min) ─────
log "waiting for HF model weights at ~/.cache/huggingface/hub/$HF_MODEL_DIR"
mkdir -p "$HOME/.cache/huggingface/hub"
for i in {1..180}; do  # 180 × 10s = 30 min cap
  if [[ -d "$HOME/.cache/huggingface/hub/$HF_MODEL_DIR" ]] && \
     [[ -n "$(ls -A "$HOME/.cache/huggingface/hub/$HF_MODEL_DIR" 2>/dev/null)" ]]; then
    # let rsync settle: same size across two samples 10s apart
    sz1=$(du -sb "$HOME/.cache/huggingface/hub/$HF_MODEL_DIR" 2>/dev/null | awk '{print $1}')
    sleep 10
    sz2=$(du -sb "$HOME/.cache/huggingface/hub/$HF_MODEL_DIR" 2>/dev/null | awk '{print $1}')
    [[ "$sz1" == "$sz2" && "$sz1" -gt 1000000000 ]] && break
  fi
  [[ $((i % 6)) -eq 0 ]] && echo "  …still waiting (${i}/180, size so far: $(du -sh "$HOME/.cache/huggingface/hub/$HF_MODEL_DIR" 2>/dev/null | awk '{print $1}'))"
  sleep 10
done

# ───── 7. LiteLLM + Prisma ─────
log "pip — litellm[proxy] + prisma"
pip3 install --break-system-packages 'litellm[proxy]' prisma >/dev/null
grep -q 'HOME/.local/bin' "$HOME/.bashrc" || \
  echo 'export PATH="$HOME/.local/bin:$PATH"' >> "$HOME/.bashrc"
export PATH="$HOME/.local/bin:$PATH"

# ───── 8. spark-ai-stack + standalone postgres ─────
log "spark-ai-stack + postgres"
mkdir -p "$HOME/spark-ai-stack/logs"
cd "$HOME/spark-ai-stack"

docker rm -f litellm-db 2>/dev/null || true
docker run -d --name litellm-db --restart unless-stopped \
  -e POSTGRES_USER=litellm \
  -e POSTGRES_PASSWORD=litellm \
  -e POSTGRES_DB=litellm \
  -p 5432:5432 \
  -v litellm_db:/var/lib/postgresql/data \
  postgres:16

for i in {1..40}; do
  docker exec litellm-db pg_isready -U litellm >/dev/null 2>&1 && break
  sleep 2
done
docker exec litellm-db pg_isready -U litellm >/dev/null 2>&1 \
  || die "postgres (litellm-db) never became ready — check 'docker logs litellm-db'"

# ───── 9. client litellm-config.yaml (points DAC → vLLM) ─────
log "client litellm-config.yaml"
cat > "$HOME/spark-ai-stack/litellm-config.yaml" <<EOF
model_list:
  - model_name: Qwen3.5-122B-Non-Reasoning
    litellm_params:
      model: openai/$SERVED_MODEL_NAME
      api_base: http://$NODE1_DAC_IP:8000/v1
      api_key: "not-needed"
      max_tokens: 8192
      extra_body:
        chat_template_kwargs:
          enable_thinking: false
    model_info:
      supports_function_calling: true
      supports_tool_choice: true
      max_context_window: 262144
      max_input_tokens: 229376
      max_output_tokens: 8192
  - model_name: Qwen3.5-122B-Reasoning
    litellm_params:
      model: openai/$SERVED_MODEL_NAME
      api_base: http://$NODE1_DAC_IP:8000/v1
      api_key: "not-needed"
      max_tokens: 32768
      extra_body:
        chat_template_kwargs:
          enable_thinking: true
    model_info:
      supports_function_calling: true
      supports_tool_choice: true
      max_context_window: 262144
      max_input_tokens: 229376
      max_output_tokens: 32768
litellm_settings:
  verbose: true
  store_model_in_db: true
  log_config:
    level: INFO
    format: json
    filepath: /home/$USERNAME/spark-ai-stack/logs/litellm.log
router_settings:
  num_retries: 0
  timeout: 600
general_settings:
  master_key: $CLIENT_MASTER_KEY
  database_url: "postgresql://litellm:litellm@localhost:5432/litellm"
EOF

log "prisma db push"
PRISMA_SCHEMA="$(python3 -c 'import litellm,os;print(os.path.join(os.path.dirname(litellm.__file__),"proxy","schema.prisma"))')"
DATABASE_URL="postgresql://litellm:litellm@localhost:5432/litellm" \
  "$HOME/.local/bin/prisma" db push --schema "$PRISMA_SCHEMA"

# ───── 10. client litellm systemd ─────
log "systemd — litellm (client)"
sudo tee /etc/systemd/system/litellm.service >/dev/null <<EOF
[Unit]
Description=LiteLLM Proxy (client)
After=network.target docker.service
Wants=docker.service

[Service]
Type=simple
User=$USERNAME
WorkingDirectory=/home/$USERNAME/spark-ai-stack
ExecStart=/home/$USERNAME/.local/bin/litellm --config litellm-config.yaml --port 8001 --host 0.0.0.0
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now litellm

# >>>OPT:openwebui
# ───── 11. client Open WebUI ─────
log "client open-webui"
docker rm -f open-webui 2>/dev/null || true
docker run -d --name open-webui --restart unless-stopped \
  -p 8080:8080 \
  -v open-webui:/app/backend/data \
  -e OPENAI_API_BASE_URL="http://host.docker.internal:8001/v1" \
  -e OPENAI_API_KEY="$CLIENT_MASTER_KEY" \
  -e WEBUI_AUTH=True \
  -e ENABLE_OLLAMA_API=False \
  --add-host=host.docker.internal:host-gateway \
  ghcr.io/open-webui/open-webui:main
# <<>>OPT:n8n
# ───── 12. client n8n ─────
log "client n8n"
# Fall back to plain 'spark-02' if the worksheet field is blank or unfilled.
WEBHOOK_HOST="$CLIENT_TAILNET_HOSTNAME"
[[ -z "$WEBHOOK_HOST" || "$WEBHOOK_HOST" == "CLIENT_TAILNET_HOSTNAME" ]] && WEBHOOK_HOST=spark-02
docker rm -f n8n 2>/dev/null || true
docker run -d --name n8n --restart unless-stopped \
  -p 5678:5678 \
  -e N8N_HOST=0.0.0.0 \
  -e N8N_PORT=5678 \
  -e N8N_PROTOCOL=http \
  -e WEBHOOK_URL="http://$WEBHOOK_HOST:5678/" \
  -e N8N_SECURE_COOKIE=false \
  -e NODE_ENV=production \
  -v n8n_data:/home/node/.n8n \
  --add-host=host.docker.internal:host-gateway \
  n8nio/n8n:latest
# <<>>OPT:tailscale
# ───── 13. Tailscale (client tailnet) ─────
# Prereqs in the CLIENT tailnet:
#   1. tag:client-ai must be declared in tagOwners in the client's policy file.
#   2. The auth key must be generated WITH tag:client-ai selected (Admin → Settings → Keys).
# DGX Spark may ship with Tailscale pre-installed and possibly pre-joined; install.sh
# is idempotent and upgrades in place, and `tailscale logout` ensures a clean re-join.
log "tailscale install + up (client tailnet)"
curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale logout 2>/dev/null || true
if [[ -n "$TS_AUTHKEY" && "$TS_AUTHKEY" != "YOUR_NODE2_TS_AUTHKEY" ]]; then
  sudo tailscale up --auth-key="$TS_AUTHKEY" --hostname=spark-02 --advertise-tags=tag:client-ai
else
  echo "  (TS_AUTHKEY blank — run: sudo tailscale up --hostname=spark-02 --advertise-tags=tag:client-ai)"
fi
# <<>>OPT:ufw
# ───── 14. host firewall ─────
log "ufw"
sudo ufw --force default deny incoming
sudo ufw default allow outgoing
# >>>OPT:tailscale
sudo ufw allow in on tailscale0 to any port 8001 proto tcp
# >>>OPT:openwebui
sudo ufw allow in on tailscale0 to any port 8080 proto tcp
# <<>>OPT:n8n
sudo ufw allow in on tailscale0 to any port 5678 proto tcp
# <<
Curl-from-gist pattern (optional). If you'd rather not scp the scripts, click download, push the file to a private gist or your own host, then on each node: curl -fsSL https://<your-host>/setup-spark-01.sh | bash. The downloaded scripts have the worksheet values baked in, so anyone who can read the URL can also read your secrets — use a one-time gist or self-host behind auth, and rotate the master keys after the run.

Prerequisites

  • Two Nvidia DGX Spark nodes — Grace CPU, GB10 GPU, arm64/aarch64, each running Ubuntu 24.04
  • Each node referred to as spark-01 (your node) and spark-02 (client node) — substitute your own hostnames
  • Both parties have read and accepted the Trust model section above
  • Docker installed and enabled on both nodes: sudo systemctl enable docker
  • Your Linux user added to the docker group on both nodes: sudo usermod -aG docker YOUR_USERNAME && newgrp docker
  • 200GbE DAC cable between the two nodes (included in the dual-node DGX Spark bundle) — interface enp1s0f0np0, MTU 9216, point-to-point /30. NCCL is configured to use this interface for all tensor parallel all-reduce communication.
  • Mgmt LAN reachability between both nodes (1 GbE RJ45 with default routes)
  • Passwordless SSH both directions (spark-01 ↔ spark-02) — required for the HF cache rsync in Step 01
  • Mgmt-IP entries in /etc/hosts on both nodes so hostnames resolve to mgmt addresses, not the DAC IP (commands below)
  • Replace YOUR_USERNAME with your Linux username throughout
  • Replace YOUR_NODE1_MGMT_IP / YOUR_NODE2_MGMT_IP with each node's mgmt IP, and YOUR_NODE1_DAC_IP / YOUR_NODE2_DAC_IP with each node's DAC IP

Bootstrap on both nodes — /etc/hosts and docker group

By default, the hostname of each node resolves to its DAC IP (198.51.100.x), not the mgmt IP. SSH from one node to the other by hostname will fail until you anchor the hostnames to mgmt IPs explicitly.

#### Run on both spark-01 AND spark-02

bash
# Add mgmt-IP entries for both nodes
echo "YOUR_NODE1_MGMT_IP  spark-01" | sudo tee -a /etc/hosts
echo "YOUR_NODE2_MGMT_IP  spark-02" | sudo tee -a /etc/hosts

# Add your user to the docker group (then re-login or use newgrp)
sudo usermod -aG docker YOUR_USERNAME
newgrp docker

# Verify SSH by hostname both directions
ssh spark-01 hostname    # from spark-02
ssh spark-02 hostname    # from spark-01
If you skip the /etc/hosts step, the rsync of the Hugging Face cache between nodes (Step 01) and any later ssh spark-0X command will silently target the DAC interface — which won't have sshd bound to it unless you've changed defaults. The symptom is a "connection refused" or hang.

Disable the desktop environment — both nodes

The DGX Spark ships with Ubuntu Desktop. Both nodes operate as headless servers with no monitor connected — stop and disable the display stack before running any workloads. This frees GPU memory and eliminates background display scheduling noise. The multi-user target persists across reboots. To start the GUI temporarily if needed: sudo systemctl start graphical.target

Pre-step — verify SSH access to both nodes before disabling the GUI

Run from your Mac or any external machine on the same network. Both commands must return successfully before continuing. If either fails, resolve SSH access before proceeding — once the display manager is stopped you will have no local GUI fallback.

bash
ssh YOUR_USERNAME@YOUR_NODE1_MGMT_IP "echo spark-01 SSH OK"
ssh YOUR_USERNAME@YOUR_NODE2_MGMT_IP "echo spark-02 SSH OK"

#### Run on BOTH spark-01 and spark-02

bash
sudo systemctl stop gdm
sudo systemctl set-default multi-user.target
sudo systemctl stop gnome-remote-desktop
sudo systemctl disable gnome-remote-desktop

# Verify — should return no output
ps aux | grep -E "Xorg|gnome" | grep -v grep
STEP 01

vLLM clustered — TP=2 over Ray on RoCE/RDMA

vLLM is the only clustered service. The model runs with tensor-parallel size 2: spark-01 hosts the Ray master and the vLLM head process; spark-02 hosts a Ray worker. NCCL collectives flow over RoCE/RDMA on the DAC link — not TCP sockets. The earlier hand-built vllm-spark:26.04 approach is obsolete: it ran NCCL over TCP/IP (no /dev/infiniband passthrough, no NCCL_IB_HCA), so steady-state throughput sat at 2–3 tok/s instead of the 45 tok/s the hardware is capable of. The community-maintained eugr/spark-vllm-docker stack ships pre-built SM121a (Blackwell) wheels and wires up the RoCE/infiniband passthrough correctly. Use it.

Production model

TrackModelNotes
Production (default) Qwen/Qwen3.5-122B-A10B-FP8 The intended daily driver — 122B / A10B MoE, official Qwen FP8. ~61 GB resident per node. Includes an MTP head (qwen3_next_mtp) but speculative decoding is disabled — unstable in vLLM v0.19.0 (HTTP 500 on requests with standard sampling params). Steady-state throughput ~45 tok/s at TP=2 without MTP. 262K context window. Recipe: qwen3.5-122b-fp8.
Bootstrap fallback Qwen/Qwen3.6-35B-A3B-FP8 35B MoE / A3B activation, FP8 quantized. Useful for fast iteration on cluster wiring (Ray, NCCL, RoCE) before committing to the longer 122B load.
Tested but does not fit / not supported Qwen/Qwen3-235B-A22B-FP8 · Sehyo/Qwen3.5-122B-A10B-NVFP4 235B FP8 ≈ 117.5 GB per node — no room for KV cache; Ray OOMs. NVFP4 is single-node only on DGX Spark today; multi-node NVFP4 fails at cluster launch. See "Other known issues".

Step 1a — Verify RoCE interfaces on both nodes

The DAC link presents two RoCE devices per port-twin. Only the active port (port 0 on each card) is used; the second port stays Down on a standard DAC pair.

#### Run on BOTH spark-01 and spark-02

bash
ibdev2netdev
ls /dev/infiniband/
Expected from ibdev2netdev on each node:
rocep1s0f0 → enp1s0f0np0 (Up)  ← active, port 0
roceP2p1s0f0 → enP2p1s0f0np0 (Up)  ← active, port 0 twin
rocep1s0f1 → enp1s0f1np1 (Down)  ← DAC only uses port 0
roceP2p1s0f1 → enP2p1s0f1np1 (Down)
Expected from ls /dev/infiniband/: rdma_cm umad0 umad1 umad2 umad3 uverbs0 uverbs1 uverbs2 uverbs3
If /dev/infiniband is missing or either port-0 device is Down, fix the RoCE plumbing on the host before continuing — RDMA passthrough into the container can only work if the devices are present and Up. A working NCCL RDMA path gives ~45 tok/s; TCP fallback gives ~2–3 tok/s.

Step 1b — Clone and build the image (spark-01)

The build pulls the pre-built SM121a (Blackwell) wheels — no compilation required — and the helper script auto-copies the resulting image to spark-02 over the DAC. The --tf5 flag is required for the container variant used by the production recipe. Total build time ~30 minutes (mostly image pull + cross-node copy).

bash
git clone https://github.com/eugr/spark-vllm-docker.git
cd spark-vllm-docker
./build-and-copy.sh --tf5 --copy-to YOUR_NODE2_DAC_IP
After the script finishes, docker images | grep vllm-node-tf5 on both nodes should show vllm-node-tf5:latest at ~19 GB.

Step 1c — Autodiscovery (spark-01)

The discovery step detects the local and peer DAC IPs, identifies the RoCE twins for NCCL_IB_HCA, and writes the values to .env. Every subsequent run-recipe.sh invocation reads from this file — get it right once and the cluster launch becomes a one-liner.

bash
cd ~/spark-vllm-docker
./run-recipe.sh --discover

Accept the prompts. The resulting .env file should contain:

env
CLUSTER_NODES=YOUR_NODE1_DAC_IP,YOUR_NODE2_DAC_IP
COPY_HOSTS=YOUR_NODE2_DAC_IP
LOCAL_IP=YOUR_NODE1_DAC_IP
ETH_IF=enp1s0f0np0
IB_IF=rocep1s0f0,roceP2p1s0f0
CONTAINER_HF_TOKEN=<your_hf_token>

Append the Hugging Face token after discovery — the discovery prompt won't ask for it:

bash
echo "CONTAINER_HF_TOKEN=YOUR_HF_TOKEN" >> ~/spark-vllm-docker/.env

Step 1d — Download the model and rsync to spark-02 over the DAC

Both nodes need the model weights resident locally. Pull on spark-01 first, then rsync to spark-02 over the DAC — keeps the ~60 GB transfer off the mgmt LAN.

#### spark-01

bash
# Run as YOUR_USERNAME — not root
HF_TOKEN=YOUR_HF_TOKEN hf download Qwen/Qwen3.5-122B-A10B-FP8 \
  --local-dir ~/.cache/huggingface/hub/models--Qwen--Qwen3.5-122B-A10B-FP8/

# Bind explicitly to the local DAC IP so the rsync runs over enp1s0f0np0
rsync -avP \
  -e "ssh -b YOUR_NODE1_DAC_IP" \
  ~/.cache/huggingface/hub/models--Qwen--Qwen3.5-122B-A10B-FP8/ \
  YOUR_NODE2_DAC_IP:~/.cache/huggingface/hub/models--Qwen--Qwen3.5-122B-A10B-FP8/
Verify the transfer ran across the DAC with nload enp1s0f0np0 on spark-02 in another shell during the rsync.

Step 1e — Launch the cluster

The launch script reads .env, starts the head container on spark-01, SSHes into spark-02 to start the worker container, and forms the Ray cluster. Both containers are started from the same image, so Ray versions are guaranteed to match.

The production recipe is at ~/spark-vllm-docker/recipes/qwen3.5-122b-fp8.yaml. Key parameters baked into the recipe:

yaml
model: Qwen/Qwen3.5-122B-A10B-FP8
container: vllm-node-tf5
max_model_len: 262144
max_num_batched_tokens: 8192
mods: mods/fix-qwen3.5-chat-template
env:
  HF_HUB_OFFLINE: 1
  TRANSFORMERS_OFFLINE: 1

These vars prevent vLLM from attempting to reach huggingface.co on every container start. The model weights are already present in the local HF cache — DNS failures after a power outage or network interruption will not block startup.

Additional flags passed to vllm serve by the recipe:

text
--load-format fastsafetensors --enable-prefix-caching --enable-auto-tool-choice \
--tool-call-parser qwen3_coder --reasoning-parser qwen3 --chat-template unsloth.jinja \
-tp 2 --distributed-executor-backend ray \
--max-num-batched-tokens 8192 \
--default-chat-template-kwargs '{"enable_thinking": false}'

Launch with gpu_memory_utilization passed as a CLI override (the recipe default is overridden here):

bash
cd ~/spark-vllm-docker
./run-recipe.sh qwen3.5-122b-fp8 -d -- \
  --served-model-name qwen3.5-122b \
  --gpu-memory-utilization 0.80

The script starts the Ray head on spark-01, then SSHes into spark-02 to start the worker. It waits until both GPUs register with the Ray cluster before starting vllm serve. Do not interrupt between head start and cluster formation — if you need to restart, stop both containers first (docker stop vllm_node on each node), then re-run the launch command on spark-01.

What run-recipe.sh does automatically:

  • Sets NCCL_IB_HCA=rocep1s0f0,roceP2p1s0f0 so NCCL pins to both RoCE twins.
  • Passes /dev/infiniband devices into both containers (RDMA verbs + CM).
  • Forms the Ray cluster (head on spark-01, worker on spark-02) and waits until both GPUs are registered before starting vllm serve.
  • Applies the mods/fix-qwen3.5-chat-template mod and uses fastsafetensors for fast loader I/O.
⚠️
MTP speculative decoding — do not enable on this stack

The FP8 model includes a Multi-Token Prediction (MTP) head (qwen3_next_mtp), and vLLM supports it via --speculative-config. Do not enable it.

As of vLLM v0.19.0, MTP speculative decoding is actively unstable on Qwen3.5-class models:
  • Clients sending min_p or logit_bias sampling parameters (default behavior in Open WebUI, Hermes, and most frontends) receive a hard HTTP 500 — "The min_p and logit_bias sampling parameters are not yet supported with speculative decoding" — breaking all inference across every client simultaneously.
  • Tool calls fail or produce malformed output every 3–4 calls under qwen3_next_mtp. (vllm-project/vllm#35800)
  • Long-sequence requests crash with illegal memory access.
  • Generation quality degrades across multi-turn sessions, collapsing to 0% draft acceptance rate.

vLLM's own documentation acknowledges speculative decoding is not yet optimized for all sampling parameters. Omit --speculative-config entirely until upstream fixes these issues. Steady-state throughput without MTP is ~45 tok/s on this hardware — the benefit does not justify the breakage.

Watch the launch:

bash
docker logs -f vllm_node

Step 1f — Verification

#### spark-01 — confirm NCCL is using RDMA, not TCP sockets

bash
docker logs vllm_node | grep -E "NET/IB|NET/Socket"
Good: lines like NCCL INFO NET/IB : Using [0]rocep1s0f0:1/IB [1]roceP2p1s0f0:1/IB — NCCL is on RDMA.
Bad: NCCL INFO NET/Socket : Using … — NCCL fell back to TCP; throughput will be ~2–3 tok/s. See Cluster issues.

#### spark-01 — Ray cluster status

bash
docker exec vllm_node ray status
Expected: 2 nodes total, 2.0/2.0 GPU, both DAC IPs listed (YOUR_NODE1_DAC_IP and YOUR_NODE2_DAC_IP).

#### spark-01 — model endpoint

bash
curl http://localhost:8000/v1/models
curl http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"qwen3.5-122b","messages":[{"role":"user","content":"hi"}],"max_tokens":16}'
Expected: {"data":[{"id":"qwen3.5-122b",...}]} on the first call. The first completion takes ~20s while CUDA graphs warm up; subsequent completions stream at ~45 tok/s.

#### Both nodes — GPU residency (GB10 quirk)

GB10 uses unified memory. The standard --query-gpu=memory.used,memory.total fields return [N/A] on this hardware — expected. Use plain nvidia-smi and read the Processes section:

bash
nvidia-smi   # run on each node

You should see a vllm / RayWorkerWrapper process on each node with roughly ~61 GB resident at the FP8 weight footprint — leaving headroom for KV cache at gpu_memory_utilization=0.80.

Step 1g — Systemd auto-start

Run only on spark-01 — the cluster launcher SSHes into spark-02 to bring up the worker. The 30-second ExecStartPre sleep gives spark-02 time to finish booting and have Docker running before the head SSHes in.

bash
sudo tee /etc/systemd/system/vllm-cluster.service << 'EOF'
[Unit]
Description=vLLM Cluster - Qwen3.5-122B FP8
After=network-online.target docker.service
Wants=network-online.target
Requires=docker.service

[Service]
Type=oneshot
RemainAfterExit=yes
User=YOUR_USERNAME
WorkingDirectory=/home/YOUR_USERNAME/spark-vllm-docker
ExecStartPre=/bin/sleep 30
ExecStart=/home/YOUR_USERNAME/spark-vllm-docker/run-recipe.sh qwen3.5-122b-fp8 -d -- \
  --served-model-name qwen3.5-122b \
  --gpu-memory-utilization 0.80
TimeoutStartSec=300
TimeoutStopSec=60

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable vllm-cluster.service
After a reboot of both nodes, systemctl status vllm-cluster on spark-01 should show active (exited) and curl http://localhost:8000/v1/models should return the model list once warmup completes.

To manage the cluster after initial setup:

bash
# Start
sudo systemctl start vllm-cluster.service

# Stop
sudo systemctl stop vllm-cluster.service

# Restart (spark-02 worker restarts automatically via launch-cluster.sh)
sudo systemctl restart vllm-cluster.service

# Logs
sudo journalctl -u vllm-cluster.service -f
docker logs -f vllm_node

Performance results

MetricValueNotes
Hardware2× NVIDIA DGX Spark (GB10, 128 GB unified memory each)Total 256 GB unified pool
ModelQwen/Qwen3.5-122B-A10B-FP8122B / A10B MoE, FP8 (official Qwen)
Network200 Gb/s DAC (QSFP) · NCCL over RoCE/RDMANCCL_IB_HCA=rocep1s0f0,roceP2p1s0f0
Tensor parallelTP=2 across both nodesRay-backed distributed executor
First request (CUDA-graph warmup)~20 sOne-time cost per cold start
Steady-state throughput~45 tok/sSingle-stream decode
Context window262,144 tokensFull Qwen3.5 context retained
Memory per node~61 GB residentFP8 weights; leaves remaining unified memory for KV cache at gpu_memory_utilization=0.80
Speculative decodingDisabledMTP (qwen3_next_mtp) is not stable in vLLM v0.19.0 — causes HTTP 500 on requests with standard sampling params. See Step 1e warning.
Previous (hand-built, NCCL over TCP)~2–3 tok/sSame hardware, wrong transport
Improvement~18×From correct NCCL RoCE configuration alone

The performance ceiling is GB10 LPDDR5X memory bandwidth (273 GB/s). With 10B active parameters at INT4 (≈5 GB of weight reads per decoded token), the theoretical maximum is ~55 tok/s per node. 45 tok/s is ~82% of the memory-bandwidth ceiling — there is not much performance left on the table.

Bootstrap fallback — the 35B FP8 model

The 35B FP8 model is still useful for fast iteration on cluster wiring (Ray, NCCL, RoCE) before committing to the longer 122B load. The eugr stack ships a recipe for it; if you've populated the cache, fall back with:

bash
cd ~/spark-vllm-docker
./run-recipe.sh qwen3.6-35b-fp8 -d -- \
  --served-model-name qwen3.6-35b \
  --default-chat-template-kwargs '{"enable_thinking": false}'

Update --served-model-name in the LiteLLM config in Step 02 (and the client's LiteLLM in Step 03) to match if you fall back.

STEP 02

Your LiteLLM proxy on spark-01

This is your LiteLLM proxy — your master key, your SQLite log corpus, your routing rules. It serves only your application stack on spark-01 (your Open WebUI, your Hermes, your n8n). The client gets their own separate LiteLLM in Step 03.

Your LiteLLM lives on the same node as the vLLM head and points at localhost:8000. The clustered vLLM presents one logical OpenAI-compatible endpoint — LiteLLM doesn't need to know there are two physical nodes behind it.

LiteLLM has no arm64 Docker image — install via pip directly on the host. This is unchanged from a single-node setup.

#### spark-01 — directories and install

bash
mkdir -p ~/spark-ai-stack/logs
cd ~/spark-ai-stack

pip3 install litellm[proxy] --break-system-packages
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc && source ~/.bashrc
Verify: ~/.local/bin/litellm --version

#### spark-01 — config (single backend pointing at the local clustered vLLM)

bash
cat > ~/spark-ai-stack/litellm-config.yaml << 'EOF'
model_list:
  # Default — thinking off. Used by all clients unless they explicitly select the reasoning model.
  - model_name: Qwen3.5-122B-Non-Reasoning
    litellm_params:
      model: openai/qwen3.5-122b
      api_base: http://localhost:8000/v1
      api_key: "not-needed"
      max_tokens: 8192
      extra_body:
        chat_template_kwargs:
          enable_thinking: false
    model_info:
      supports_function_calling: true
      supports_tool_choice: true
      max_context_window: 262144
      max_input_tokens: 229376
      max_output_tokens: 8192

  # Opt-in reasoning — user selects this model explicitly for complex tasks.
  - model_name: Qwen3.5-122B-Reasoning
    litellm_params:
      model: openai/qwen3.5-122b
      api_base: http://localhost:8000/v1
      api_key: "not-needed"
      max_tokens: 32768
      extra_body:
        chat_template_kwargs:
          enable_thinking: true
    model_info:
      supports_function_calling: true
      supports_tool_choice: true
      max_context_window: 262144
      max_input_tokens: 229376
      max_output_tokens: 32768

litellm_settings:
  verbose: true
  store_model_in_db: true
  default_system_message: "You are a highly capable AI assistant. Be direct,
    accurate, and concise. Answer immediately without preamble. Never deliberate
    out loud about whether or how to answer. For coding: produce complete,
    working code. State definitive answers first then explain."

router_settings:
  num_retries: 0
  timeout: 600

general_settings:
  master_key: YOUR_MASTER_KEY
  database_url: "postgresql://litellm:litellm@localhost:5432/litellm"
  mcp_settings:
    allow_all_keys: true
EOF
Two model entries point at the same vLLM backend. Thinking is disabled by default — vLLM disables it at the container level via --default-chat-template-kwargs, and LiteLLM reinforces this per-entry via extra_body. Users select Qwen3.5-122B-Reasoning in Open WebUI, n8n, or any frontend when they need extended reasoning. All other requests use Qwen3.5-122B-Non-Reasoning.

#### spark-01 — systemd service

bash
sudo tee /etc/systemd/system/litellm.service << 'EOF'
[Unit]
Description=LiteLLM Proxy
After=network.target docker.service
Wants=docker.service

[Service]
Type=simple
User=YOUR_USERNAME
WorkingDirectory=/home/YOUR_USERNAME/spark-ai-stack
ExecStart=/home/YOUR_USERNAME/.local/bin/litellm --config litellm-config.yaml --port 8001 --host 0.0.0.0
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable litellm
sudo systemctl start litellm
Verify from spark-01: curl http://localhost:8001/v1/models (no key needed if you didn't set master_key yet, otherwise use -H "Authorization: Bearer YOUR_MASTER_KEY")
Your LiteLLM should NOT be reachable from spark-02. If you are using Tailscale ACLs (Step 06), only your tailnet should reach spark-01:8001. The client uses their own LiteLLM (Step 03) — they never call yours.

Step 2d — PostgreSQL for the Admin UI and virtual keys (required)

The LiteLLM Admin UI and virtual key system require PostgreSQL — this is why the step exists. LiteLLM's Prisma schema is hardcoded for PostgreSQL, and the Admin UI, virtual key generation, and model management all write to the PostgreSQL metadata database.

#### spark-01 — bring up the litellm-db postgres container

On spark-01 the postgres container lives in ~/spark-ai-stack/docker-compose.yml alongside n8n:

yaml
services:
  litellm-db:
    image: postgres:16
    container_name: litellm-db
    restart: unless-stopped
    environment:
      - POSTGRES_USER=litellm
      - POSTGRES_PASSWORD=litellm
      - POSTGRES_DB=litellm
    volumes:
      - litellm_db:/var/lib/postgresql/data
    ports:
      - "5432:5432"

volumes:
  litellm_db:
bash
cd ~/spark-ai-stack
docker compose up -d litellm-db
docker compose ps litellm-db

#### spark-01 — apply the Prisma schema

Install the Prisma CLI if missing, then push the LiteLLM Prisma schema into the new database. This must run once on spark-01 before LiteLLM starts, otherwise the UI will return table public.LiteLLM_UserTable does not exist.

bash
pip install prisma --break-system-packages

DATABASE_URL="postgresql://litellm:litellm@localhost:5432/litellm" \
  prisma db push \
  --schema /home/YOUR_USERNAME/.local/lib/python3.12/site-packages/litellm/proxy/schema.prisma
Expected: Your database is now in sync with your Prisma schema. Done in <Ns>

#### spark-01 — wire the database into litellm-config.yaml

Add the database_url to general_settings (not at the top level — see troubleshooting) and enable model-in-DB storage so the UI can edit the model list:

yaml
general_settings:
  master_key: YOUR_MASTER_KEY
  database_url: "postgresql://litellm:litellm@localhost:5432/litellm"

litellm_settings:
  store_model_in_db: true

Restart and verify:

bash
sudo systemctl restart litellm
sudo systemctl status litellm --no-pager
curl -s http://localhost:8001/health/readiness | head
UI is at http://spark-01:8001/ui — log in with admin and your master key. Generate per-service virtual keys from the Virtual Keys tab (one for Open WebUI, one for n8n, one for Hermes — never paste the master key into a downstream service).
STEP 03

Client LiteLLM proxy on spark-02

The client gets their own LiteLLM proxy on spark-02, with their own master key, their own log corpus, and their own routing rules. It points at the shared vLLM endpoint over the DAC link. This is not a copy of spark-01's LiteLLM — it has no shared config, no shared key, no shared logs. The client controls their own master key and never shares it with you.

If you are setting up spark-02 on behalf of the client, hand off the master-key generation step (or have them rotate the key the moment they take over). The point of split trust is that you do not hold the client's API credentials.

#### spark-02 — install

bash
mkdir -p ~/spark-ai-stack/logs
cd ~/spark-ai-stack

pip3 install litellm[proxy] --break-system-packages
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc && source ~/.bashrc

#### spark-02 — generate the client master key

Run on the client's terminal — store this key only on spark-02:

bash
echo "sk-client-$(openssl rand -hex 16)"

#### spark-02 — config (points at vLLM over the DAC)

bash
cat > ~/spark-ai-stack/litellm-config.yaml << 'EOF'
model_list:
  - model_name: Qwen3.5-122B-Non-Reasoning
    litellm_params:
      model: openai/qwen3.5-122b
      api_base: http://YOUR_NODE1_DAC_IP:8000/v1
      api_key: "not-needed"
      max_tokens: 8192
      extra_body:
        chat_template_kwargs:
          enable_thinking: false
    model_info:
      supports_function_calling: true
      supports_tool_choice: true
      max_context_window: 262144
      max_input_tokens: 229376
      max_output_tokens: 8192

  - model_name: Qwen3.5-122B-Reasoning
    litellm_params:
      model: openai/qwen3.5-122b
      api_base: http://YOUR_NODE1_DAC_IP:8000/v1
      api_key: "not-needed"
      max_tokens: 32768
      extra_body:
        chat_template_kwargs:
          enable_thinking: true
    model_info:
      supports_function_calling: true
      supports_tool_choice: true
      max_context_window: 262144
      max_input_tokens: 229376
      max_output_tokens: 32768

litellm_settings:
  verbose: true
  database:
    type: sqlite
    path: /home/YOUR_USERNAME/spark-ai-stack/logs/litellm.db
  log_config:
    level: INFO
    format: json
    filepath: /home/YOUR_USERNAME/spark-ai-stack/logs/litellm.log

general_settings:
  master_key: YOUR_CLIENT_MASTER_KEY    # set to the sk-client-... value above

router_settings:
  num_retries: 0
  timeout: 600
EOF
The DAC is the fastest path between the two nodes — substantially lower latency than going over the mgmt LAN. Do not point the client LiteLLM at YOUR_NODE1_MGMT_IP:8000 unless the DAC is down.

#### spark-02 — systemd service (independent of spark-01)

bash
sudo tee /etc/systemd/system/litellm.service << 'EOF'
[Unit]
Description=LiteLLM Proxy (client)
After=network.target docker.service
Wants=docker.service

[Service]
Type=simple
User=YOUR_USERNAME
WorkingDirectory=/home/YOUR_USERNAME/spark-ai-stack
ExecStart=/home/YOUR_USERNAME/.local/bin/litellm --config litellm-config.yaml --port 8001 --host 0.0.0.0
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable litellm
sudo systemctl start litellm
sudo systemctl status litellm --no-pager

Verification — confirm the request hits spark-01:8000

#### spark-02 — local LiteLLM responds with the client key

bash
curl http://localhost:8001/v1/models \
  -H "Authorization: Bearer YOUR_CLIENT_MASTER_KEY"

#### spark-02 — inference round-trip through the shared backend

bash
curl http://localhost:8001/v1/chat/completions \
  -H "Authorization: Bearer YOUR_CLIENT_MASTER_KEY" \
  -H 'Content-Type: application/json' \
  -d '{"model":"qwen3.5-122b","messages":[{"role":"user","content":"hi from client"}],"max_tokens":16}'
On spark-01, tail -f ~/spark-ai-stack/logs/litellm.log shows nothing — your LiteLLM is not in the path. Instead, run docker logs --tail 20 vllm_node on spark-01 — you should see the new request reach the vLLM head.

Step 3d — PostgreSQL for the client Admin UI and virtual keys (required)

The LiteLLM Admin UI and virtual key system require PostgreSQL — same requirement as Step 02. Set up PostgreSQL on spark-02 so the client's Admin UI, virtual key generation, and model management work correctly. spark-02 doesn't run a docker-compose stack, so we use a standalone postgres container.

#### spark-02 — standalone postgres container

bash
docker run -d --name litellm-db --restart unless-stopped \
  -e POSTGRES_USER=litellm \
  -e POSTGRES_PASSWORD=litellm \
  -e POSTGRES_DB=litellm \
  -p 5432:5432 \
  -v litellm_db:/var/lib/postgresql/data \
  postgres:16

#### spark-02 — apply the Prisma schema

bash
pip install prisma --break-system-packages

DATABASE_URL="postgresql://litellm:litellm@localhost:5432/litellm" \
  prisma db push \
  --schema /home/YOUR_USERNAME/.local/lib/python3.12/site-packages/litellm/proxy/schema.prisma
Expected: Your database is now in sync with your Prisma schema.

#### spark-02 — wire the database into the client litellm-config.yaml

database_url goes under general_settings, alongside the existing master_key. Add store_model_in_db: true under litellm_settings:

yaml
general_settings:
  master_key: YOUR_CLIENT_MASTER_KEY
  database_url: "postgresql://litellm:litellm@localhost:5432/litellm"

litellm_settings:
  store_model_in_db: true
bash
sudo systemctl restart litellm
sudo systemctl status litellm --no-pager
Client UI is at http://spark-02:8001/ui — log in with admin and the client master key. The client generates their own per-app virtual keys from the Virtual Keys tab; you never see them.
If a virtual key was generated before the schema was fully applied (e.g. you generated a key, then re-ran prisma db push), the old key will appear in the UI but lookups will fail with Virtual key not found in LiteLLM_VerificationTokenTable. Delete it in the UI, restart LiteLLM, then generate a new one.
STEP 04

Your Open WebUI on spark-01

Your daily-driver chat interface, owned by you, on your node. It points at your LiteLLM at http://localhost:8001/v1. The client gets their own Open WebUI on spark-02 in Step 05 — neither side can see the other's chat history, knowledge bases, RAG documents, or API keys.

#### spark-01 — directory and run

bash
mkdir -p ~/spark-ai-stack
cd ~/spark-ai-stack

docker run -d \
  --name open-webui \
  --restart unless-stopped \
  -p 8080:8080 \
  -v open-webui:/app/backend/data \
  -e OPENAI_API_BASE_URL="http://host.docker.internal:8001/v1" \
  -e OPENAI_API_KEY="YOUR_MASTER_KEY" \
  -e WEBUI_AUTH=True \
  -e ENABLE_OLLAMA_API=False \
  --add-host=host.docker.internal:host-gateway \
  ghcr.io/open-webui/open-webui:main

Visit http://localhost:8080 from spark-01 (or via your tailnet — see Step 06), create your admin account, and confirm in Settings → Connections → OpenAI API:

SettingValue
API Base URLhttp://host.docker.internal:8001/v1
API Keyyour master_key
Default modelQwen3.5-122B-Non-Reasoning
MemoryToggle ON (Settings → Personalization)
Send a chat message — it should stream back via your LiteLLM (Step 02) → vLLM head (Step 01).
STEP 05

Client Open WebUI on spark-02

The client's daily-driver chat interface, owned by the client, on the client's node. It points at the client's LiteLLM at http://localhost:8001/v1 — which in turn calls the shared vLLM head on spark-01:8000 over the DAC.

The client's data lives on the client's node. Their Open WebUI database, knowledge bases, RAG document store, embedding indexes, conversation history, attached files, and account list — all of it is in the open-webui Docker volume on spark-02. None of it is replicated to spark-01. If you spin spark-02 down, the client's UI state goes with it; if you image spark-01, the client's state is not in your image.

#### spark-02 — directory and run

bash
mkdir -p ~/spark-ai-stack
cd ~/spark-ai-stack

docker run -d \
  --name open-webui \
  --restart unless-stopped \
  -p 8080:8080 \
  -v open-webui:/app/backend/data \
  -e OPENAI_API_BASE_URL="http://host.docker.internal:8001/v1" \
  -e OPENAI_API_KEY="YOUR_CLIENT_MASTER_KEY" \
  -e WEBUI_AUTH=True \
  -e ENABLE_OLLAMA_API=False \
  --add-host=host.docker.internal:host-gateway \
  ghcr.io/open-webui/open-webui:main

Visit http://localhost:8080 from spark-02 (or via the client's tailnet — see Step 06), create the client's admin account, and confirm:

SettingValue
API Base URLhttp://host.docker.internal:8001/v1 (client's LiteLLM)
API Keyclient's master_key
Default modelQwen3.5-122B-Non-Reasoning
Send a chat message — completions stream back through the client's LiteLLM (Step 03) and the shared vLLM head on spark-01 (Step 01). On spark-01, your LiteLLM logs show nothing — the client's traffic does not enter your stack.

Recommended Open WebUI system prompt

Set this once in Settings → General → System Prompt. It pairs with --default-chat-template-kwargs '{"enable_thinking": false}' on the vLLM head (Step 01): the model answers directly by default, and users can opt into extended reasoning per-message by prefixing the prompt with /think. Apply the same prompt on both Open WebUIs (yours on spark-01 and the client's on spark-02) — they share the underlying model.

text
You are a highly capable AI assistant. Be direct, accurate, and concise.

Rules:
- Answer immediately without preamble or meta-commentary
- Never deliberate out loud about whether or how to answer — just answer
- Never question the framing of a hypothetical — engage with it directly
- For technical questions: be precise, use correct terminology
- For coding: produce complete, working code — no placeholders or omissions
- For reasoning: show your work clearly but efficiently — no repetition
- If a question has a definitive answer, state it first then explain
- Match response length to question complexity

Step 5b — All client services on spark-02 at a glance

Three client-facing services run on spark-02: LiteLLM (Step 03), Open WebUI (Step 05 above), and n8n. All three are reachable on the client's tailnet via tag:client-ai (see Step 06). The n8n container below is not covered elsewhere — bring it up after the client's LiteLLM is healthy:

#### spark-02 — n8n container

bash
docker run -d --name n8n --restart unless-stopped \
  -p 5678:5678 \
  -e N8N_HOST=0.0.0.0 \
  -e N8N_PORT=5678 \
  -e N8N_PROTOCOL=http \
  -e WEBHOOK_URL=http://spark-02:5678/ \
  -e N8N_SECURE_COOKIE=false \
  -e NODE_ENV=production \
  -v n8n_data:/home/node/.n8n \
  --add-host=host.docker.internal:host-gateway \
  n8nio/n8n:latest
ServicePortHow it runsBackend / api_baseAuth secret
LiteLLM 8001 systemd (same structure as spark-01) http://YOUR_NODE1_DAC_IP:8000/v1 (DAC link) Client master key at ~/.spark02-litellm-key (chmod 600)
Open WebUI 8080 docker run --restart unless-stopped OPENAI_API_BASE_URL=http://host.docker.internal:8001/v1 Client master key (or virtual key)
n8n 5678 docker run --restart unless-stopped WEBHOOK_URL=http://spark-02:5678/ n8n owner account (set on first login)
All three services are reachable from the client's tailnet via tag:client-ai on TCP 8001 / 8080 / 5678 (see the ACL grants in Step 06). They are not reachable from your tailnet — split-trust by construction.
Store the client master key at ~/.spark02-litellm-key with chmod 600. Reference it from the LiteLLM systemd unit via EnvironmentFile= rather than embedding it in litellm-config.yaml, so the file on disk does not contain the secret.
STEP 06

Tailscale (both nodes, separate tailnets)

Each node joins its owner's tailnet independently. Two separate tailnets, two separate ACL policies, two separate sets of users. The DAC link (198.51.100.0/30) is private physical hardware between the two nodes — it is not advertised onto either tailnet, and it is not used for any cross-tailnet routing.

Security-critical: vLLM on spark-01:8000 has no authentication. Once Tailscale is configured, ensure your Tailscale ACLs do not expose port 8000 to the client's tailnet (and the client's ACLs do not expose your node's 8000 port to anyone either). The client must only reach spark-02:8001 (their own LiteLLM). If they can reach spark-01:8000 directly, they bypass their LiteLLM entirely and have unauthenticated inference access — which also means no key-scoped logging, no rate limit, and no audit trail.

spark-01 — your tailnet

#### spark-01 — install + join

bash
curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up --hostname=spark-01 --advertise-tags=tag:owner
tailscale ip -4   # note this address — your apps will be reachable here

In your tailnet's ACL policy (Tailscale admin console), expose your Open WebUI, your LiteLLM, and your other apps only to your own users. Example ACL fragment:

json
{
  "acls": [
    { "action": "accept",
      "src":    ["group:owner-users"],
      "dst":    ["tag:owner:8080", "tag:owner:8001", "tag:owner:5678", "tag:owner:9119"]
    },
    { "action": "accept",
      "src":    ["group:owner-users"],
      "dst":    ["tag:owner:22"]
    }
  ],
  "tagOwners": {
    "tag:owner": ["YOU@example.com"]
  },
  "groups": {
    "group:owner-users": ["YOU@example.com"]
  }
}

Do NOT add tag:owner:8000 to any allow rule. Port 8000 (vLLM) is unauthenticated and must remain reachable only from localhost (your LiteLLM in Step 02) and the DAC IP 198.51.100.1 (the client's LiteLLM in Step 03).

spark-02 — client's tailnet

#### spark-02 — install + join (client's auth key)

bash
curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up \
  --authkey=<client-auth-key> \
  --hostname=spark-02 \
  --advertise-tags=tag:client-ai

tailscale ip -4   # client's apps will be reachable here, only on their tailnet

The client's tailnet ACLs are theirs to author. Mirror the structure above with their own users and tags. The client should expose :8080 (their Open WebUI), :8001 (their LiteLLM, only if they want programmatic access from elsewhere), and :5678 (their n8n).

Disable key expiry for spark-02 in the client's Tailscale admin (Machines → spark-02 → ⋯ → Disable key expiry). Server nodes shouldn't drop off the tailnet on a 90-day timer; the auth key rotation should be a deliberate action.

#### Client tailnet ACL — replace the default allow-all

The default Tailscale policy allows everything between everyone. Replace it with this grants-based policy. It leaves all non-server devices unrestricted (so the client's existing fleet is untouched), keeps tag:client-server open (their existing Dell server stays as-is), and restricts tag:client-ai (this node) to only the three service ports — 8001 (LiteLLM), 8080 (Open WebUI), 5678 (n8n).

json
{
  "grants": [
    {
      "src": ["*"],
      "dst": ["autogroup:member"],
      "ip": ["*"]
    },
    {
      "src": ["*"],
      "dst": ["tag:client-server"],
      "ip": ["*"]
    },
    {
      "src": ["*"],
      "dst": ["tag:client-ai"],
      "ip": ["tcp:8001", "tcp:8080", "tcp:5678"]
    }
  ],
  "tagOwners": {
    "tag:client-ai":         ["autogroup:admin"],
    "tag:client-server":["autogroup:admin"]
  }
}
After applying, from a client device on the same tailnet: nc -zv spark-02 8001, nc -zv spark-02 8080, and nc -zv spark-02 5678 all succeed. nc -zv spark-02 22 (SSH) and nc -zv spark-02 8000 (raw vLLM) both fail — proving the ACL is in effect.
tag:client-ai is intentionally not given tcp:8000. Port 8000 is the raw, unauthenticated vLLM head on spark-01 reached over the DAC; it must never be exposed onto the client's tailnet. The client only ever calls their own LiteLLM on spark-02:8001.

Lock down host firewalls

Tailscale ACLs are policy; the host firewall is enforcement. Apply ufw rules on both nodes so that even if Tailscale is misconfigured, ports 8000 and 8001 cannot leak to the wrong network.

#### spark-01 — host firewall

bash
# Allow from your tailnet (interface tailscale0) and DAC peer only
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow in on tailscale0 to any port 8001 proto tcp           # your LiteLLM
sudo ufw allow in on tailscale0 to any port 8080 proto tcp           # your Open WebUI
sudo ufw allow in on tailscale0 to any port 5678 proto tcp           # your n8n
sudo ufw allow in on tailscale0 to any port 9119 proto tcp           # your Hermes dashboard
sudo ufw allow in on enp1s0f0np0 from YOUR_NODE2_DAC_IP to any port 8000 proto tcp   # client LiteLLM → vLLM only
sudo ufw allow in on enp1s0f0np0 from YOUR_NODE2_DAC_IP to any port 6379 proto tcp   # Ray GCS over DAC
sudo ufw allow ssh                                                   # mgmt LAN ssh
sudo ufw enable

#### spark-02 — host firewall

bash
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow in on tailscale0 to any port 8001 proto tcp           # client LiteLLM
sudo ufw allow in on tailscale0 to any port 8080 proto tcp           # client Open WebUI
sudo ufw allow in on tailscale0 to any port 5678 proto tcp           # client n8n
sudo ufw allow in on enp1s0f0np0 from YOUR_NODE1_DAC_IP                  # NCCL/Ray over DAC
sudo ufw allow ssh
sudo ufw enable
From the client's tailnet, nc -zv spark-01-tailscale-ip 8000 should fail with "connection refused" or "filtered". From the client's tailnet, curl http://spark-02-tailscale-ip:8001/v1/models -H "Authorization: Bearer CLIENT_KEY" should succeed.
STEP 07

Your Hermes Agent on spark-01

Hermes is your autonomous agent layer (skills, memory, cron, gateways). It runs on your node and talks to your LiteLLM. The client does not get a Hermes — they have their own Open WebUI and n8n on spark-02; if they want an agent they install their own.

Not Ollama. The official NVIDIA DGX reference playbook for Hermes assumes Ollama at localhost:11434. This stack does not run Ollama — Hermes points at LiteLLM at localhost:8001 as a custom OpenAI-compatible endpoint. Ignore any Ollama references in the NVIDIA guide; the wizard answers in this step are the correct ones for this stack.

#### spark-01 — install

Pre-install note: The Hermes installer will prompt to install ripgrep (fast file search, used by agent tools) and ffmpeg (required for TTS/voice message features) via apt. Both are safe to accept. If you are running unattended (e.g. piped from curl without a TTY), the installer skips the sudo prompt and logs a warning — install them manually first to avoid the fallback:
bash
sudo apt install -y ripgrep ffmpeg
bash
curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc && source ~/.bashrc
hermes --version
The installer provisions its own isolated runtime — no manual Python or Node.js setup required:

Python 3.11  Managed by uv (Astral). Virtual env at ~/.hermes/hermes-agent/venv/. uv itself installs to ~/.local/bin/uv and ~/.cargo/bin/uv.
Node.js 22    Downloaded to ~/.hermes/node/ with symlinks in ~/.local/bin/. Powers browser automation tools (Playwright-based).
hermes CLI   Symlinked to ~/.local/bin/hermes. The PATH line added to ~/.bashrc by the installer covers this — source it or open a new shell before running hermes setup.

Submodules initialized automatically:
mini-swe-agent    Terminal/shell tool backend. Gives the agent the ability to run arbitrary shell commands on spark-01. Review what workspace access you grant in hermes setup — prefer limiting to ~/workspace.
tinker-atropos    RL-based skill self-improvement backend. Lets Hermes refine its own skill files after completing complex tasks.

#### spark-01 — setup wizard

bash
hermes setup
PromptAnswer
Setup typeFull setup
ProviderCustom endpoint
API base URLhttp://localhost:8001/v1
API keyA virtual key scoped to Hermes from the LiteLLM Admin UI (recommended — gives a per-service audit trail). The master key also works but grants unrestricted access.
ModelQwen3.5-122B-Non-Reasoning (the LiteLLM model alias — Hermes calls LiteLLM, which maps this to the vLLM backend)
Terminal backendLocal — set workspace to ~/workspace (limits shell tool scope to a safe directory)
Session reset modeInactivity + daily reset
Search providerSkip (or Brave Search if configured — see Brave Search MCP section)
Launch chat now?n

The wizard writes ~/.hermes/config.yaml. base_url is local — Hermes and your LiteLLM both live on spark-01.

Tip — Customize agent tone: Edit ~/.hermes/SOUL.md to define Hermes' personality. This file is re-read on every message — changes take effect immediately without restarting the gateway. Leave it empty to use the default personality.

Example for a terse technical assistant:
You are a concise, direct technical assistant. No filler. No preamble. State the answer first, then explain if needed. Use correct terminology.

#### Hermes file locations

PathContents
~/.hermes/config.yamlMain config — model, endpoint, providers
~/.hermes/.envAPI keys and gateway tokens
~/.hermes/SOUL.mdAgent persona/tone — hot-reloaded each message, no restart needed
~/.hermes/skills/Persistent skill documents (agentskills.io format) — bundled skills seeded here automatically on install
~/.hermes/memories/Long-term memory store
~/.hermes/sessions/Conversation session state
~/.hermes/logs/Gateway and agent logs
~/.hermes/cron/Scheduled task definitions
~/.hermes/hermes-agent/Cloned repo + venv (managed — do not edit directly)
~/.hermes/node/Node.js 22 runtime (managed — do not edit directly)

Telegram gateway

Create a bot via @BotFather (/newbot, copy the token) and get your user ID from @userinfobot. Then:

bash
hermes setup gateway   # select Telegram, paste token, paste user ID, choose System service

sudo /home/YOUR_USERNAME/.local/bin/hermes gateway install --system
sudo systemctl start hermes-gateway
sudo systemctl status hermes-gateway --no-pager
The full path is required because sudo does not inherit your user's $PATH. Alternatively, hermes gateway install (without sudo) installs a user-scope service instead of a system service — both work, but the system service starts on boot without a logged-in user session.

Built-in dashboard

Hermes ships a built-in web dashboard for managing config, API keys, and sessions. Run it as a systemd service so it starts automatically alongside the gateway.

#### spark-01 — create the dashboard service

bash
sudo tee /etc/systemd/system/hermes-dashboard.service << 'EOF'
[Unit]
Description=Hermes Agent Dashboard
After=network.target hermes-gateway.service
Wants=hermes-gateway.service

[Service]
Type=simple
User=YOUR_USERNAME
ExecStart=/home/YOUR_USERNAME/.local/bin/hermes dashboard --port 9119 --host 0.0.0.0 --insecure --no-open
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable hermes-dashboard
sudo systemctl start hermes-dashboard
sudo systemctl status hermes-dashboard --no-pager

The --insecure flag is required to bind to 0.0.0.0 instead of localhost. Protect port 9119 with Tailscale ACLs (Step 06) — it should only be reachable from your tailnet, not the open internet.

Dashboard is at http://YOUR_NODE1_TAILSCALE_IP:9119 from any device on your tailnet.
Dashboard crash-loop — Node.js build required. hermes-dashboard.service will crash-loop silently (restart counter >3600) on a fresh install because the dashboard frontend has never been compiled. The system Node.js installed in the steps above is for system-wide use (e.g. npx MCP servers); the dashboard build needs Node.js from within the Hermes install path.

Run this once after every hermes update:
bash · spark-01
curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash -
sudo apt install -y nodejs
cd ~/.hermes/hermes-agent/web && npm install && npm run build
sudo systemctl restart hermes-dashboard
sudo systemctl status hermes-dashboard --no-pager
Verify the dashboard responds: curl -s http://127.0.0.1:9119/api/status

Dashboard — basic auth (required for remote access)

Binding the dashboard to 0.0.0.0 exposes it to the local network. Add basic auth so it requires a username and password even inside the tailnet. Set these in ~/.hermes/.env on spark-01 (not in config.yaml):

bash · spark-01 — ~/.hermes/.env
HERMES_DASHBOARD_BASIC_AUTH_USERNAME=YOUR_DASHBOARD_USERNAME
HERMES_DASHBOARD_BASIC_AUTH_PASSWORD=YOUR_DASHBOARD_PASSWORD

Restart the dashboard after setting these. Verify auth is active:

bash
sudo systemctl restart hermes-dashboard
curl -s http://127.0.0.1:9119/api/status | grep auth
Expected: "auth_required": true, "auth_providers": ["basic"]

Dashboard — auto-restart on gateway update

After hermes update (or any hermes-gateway restart), the dashboard process goes stale and must be restarted. Wire this up automatically via a systemd override and a sudoers rule — the gateway service runs as your user, which can't call systemctl on system services without sudo.

#### spark-01 — sudoers rule

bash · spark-01
sudo tee /etc/sudoers.d/hermes-dashboard-restart <<'EOF'
YOUR_USERNAME ALL=(root) NOPASSWD: /usr/bin/systemctl --no-block try-restart hermes-dashboard.service
EOF

#### spark-01 — gateway service override

bash · spark-01
sudo mkdir -p /etc/systemd/system/hermes-gateway.service.d
sudo tee /etc/systemd/system/hermes-gateway.service.d/override.conf <<'EOF'
[Service]
ExecStartPost=-/usr/bin/sudo -n /usr/bin/systemctl --no-block try-restart hermes-dashboard.service
EOF

sudo systemctl daemon-reload && sudo systemctl restart hermes-gateway
The - prefix on ExecStartPost is required — it prevents a sudo failure from propagating and crashing the gateway. --no-block fires the dashboard restart asynchronously so it doesn't delay the gateway startup.

Hermes desktop app (v0.15.2+)

The Hermes desktop app (released June 2026) connects to your dashboard on spark-01 — it does not run its own backend. Without the remote URL set before first launch, the app spins up a local backend, SIGTERMs it, and enters a reset loop.

Set HERMES_DESKTOP_REMOTE_URL before launching the app for the first time. Add to ~/.hermes/.env on your Mac (not on spark-01):
bash · Mac — ~/.hermes/.env
HERMES_DESKTOP_REMOTE_URL=http://YOUR_NODE1_TAILSCALE_IP:9119
HERMES_DESKTOP_REMOTE_TOKEN=YOUR_DASHBOARD_SESSION_TOKEN

HERMES_DESKTOP_REMOTE_TOKEN must match HERMES_DASHBOARD_SESSION_TOKEN in ~/.hermes/.env on spark-01. The dashboard basic auth credentials (username/password) are separate from this token.

Hermes memory limits

Default memory limits can cause the agent to truncate context or behave inconsistently on long sessions. These values are confirmed stable on this stack:

bash · spark-01
hermes config set memory_char_limit 6000
hermes config set user_char_limit 3000
sudo systemctl restart hermes-gateway
Use hermes config set — do not hand-edit ~/.hermes/config.yaml for these values. The CLI applies validation and correct encoding that the YAML parser may not catch.

Keeping Hermes up to date

bash
# spark-01: Update Hermes to the latest release
hermes update
# Pulls latest from main, applies dependency changes, restarts the gateway service.
# Confirm it came back up:
sudo systemctl status hermes-gateway --no-pager
STEP 08

Your n8n on spark-01

Your single-instance n8n on spark-01. The client runs their own n8n on spark-02 independently — different workflows, different credentials, different Postgres-vs-SQLite state. Neither side can read the other's flows.

#### spark-01 — docker-compose for n8n

bash
cat > ~/spark-ai-stack/n8n.yml << 'EOF'
services:
  n8n:
    image: n8nio/n8n:latest
    container_name: n8n
    restart: unless-stopped
    ports:
      - "5678:5678"
    environment:
      - N8N_HOST=0.0.0.0
      - N8N_PORT=5678
      - N8N_PROTOCOL=http
      - WEBHOOK_URL=http://YOUR_TAILNET_HOSTNAME:5678/
      - N8N_SECURE_COOKIE=false
      - NODE_ENV=production
      - GENERIC_TIMEZONE=America/Los_Angeles
    volumes:
      - n8n_data:/home/node/.n8n
    extra_hosts:
      - "host.docker.internal:host-gateway"

volumes:
  n8n_data:
EOF

docker compose -f ~/spark-ai-stack/n8n.yml up -d
Visit http://localhost:5678 on spark-01 (or your tailnet hostname) and create the owner account.

Wire your n8n to your LiteLLM

In n8n, add an OpenAI credential pointing at your LiteLLM (local on spark-01):

FieldValue
API URLhttp://host.docker.internal:8001/v1
API Keyyour master_key
Default modelqwen3.5-122b
The client's n8n on spark-02 follows the same pattern but points at their LiteLLM (http://host.docker.internal:8001/v1 from inside their n8n container) with the client's master key. Step 03 covers the client's LiteLLM; the client deploys their n8n the same way you deploy yours.
CHECKLIST

Stack Validation Checklist

Both sides have to pass independently, and each side has to fail in the right places (you should not be able to reach the client's stack, and vice versa). Every item is labeled with the node it should be run from.

Shared compute pool

  • spark-01: docker exec vllm_node ray status shows 2 nodes and 2.0/2.0 GPU, with both DAC IPs (YOUR_NODE1_DAC_IP and YOUR_NODE2_DAC_IP) listed.
  • spark-01: docker logs vllm_node | grep "NCCL INFO NET/IB" shows both rocep1s0f0 and roceP2p1s0f0 — NCCL is on RoCE/RDMA, not TCP sockets.
  • spark-01: nvidia-smi Processes section shows a vllm / RayWorkerWrapper at ~61 GB (122B FP8). The --query-gpu=memory.used / memory.total fields return [N/A] on GB10 — expected, use the Processes section instead.
  • spark-02: nvidia-smi Processes section shows a vllm / RayWorkerWrapper at ~61 GB.
  • spark-01: curl http://localhost:8000/v1/models returns qwen3.5-122b (vLLM head). A streamed completion runs at ~45 tok/s steady state.

Your side (spark-01)

  • spark-01: curl http://localhost:8001/v1/models -H "Authorization: Bearer YOUR_MASTER_KEY" returns qwen3.5-122b through your LiteLLM.
  • spark-01: sudo systemctl status litellm hermes-gateway — both active (running).
  • spark-01: docker ps shows vllm_node, open-webui, and n8n all Up. sudo systemctl status hermes-dashboard is active (running).
  • spark-01: Open http://localhost:8080 (or your tailnet hostname), send a chat message — completion streams back. Your LiteLLM log records the request.
  • spark-01: Send a Telegram message — Hermes responds. End-to-end your-side path: Telegram → Hermes → your LiteLLM → vLLM TP=2. Confirm in LiteLLM logs that the request arrived on the Hermes virtual key, not the master key.
  • spark-01: hermes --version returns a version string. If MCP servers are configured: hermes mcp test <server-name> verifies MCP connectivity.
  • spark-01: sudo systemctl status hermes-gateway hermes-dashboard — both active (running).
  • spark-01: ~/.hermes/SOUL.md exists (created by installer). ~/.hermes/skills/ is populated with bundled skills seeded by the installer.

Client side (spark-02)

  • spark-02: curl http://localhost:8001/v1/models -H "Authorization: Bearer YOUR_CLIENT_MASTER_KEY" returns qwen3.5-122b through the client's LiteLLM.
  • spark-02: curl http://localhost:8001/v1/chat/completions -H "Authorization: Bearer YOUR_CLIENT_MASTER_KEY" ... returns a completion.
  • spark-02: sudo systemctl status litellmactive (running). Logs are written to ~/spark-ai-stack/logs/litellm.log on spark-02, separate from your logs.
  • spark-02: docker ps shows vllm_node (the worker), open-webui, and n8n all Up. (No Hermes container.)
  • spark-02: Open http://localhost:8080 (or the client's tailnet hostname), send a chat message — completion streams back. Client's LiteLLM log records the request; your LiteLLM log on spark-01 does not.

Cross-stack isolation (the negative tests)

  • From client's tailnet: curl http://YOUR_NODE1_TAILSCALE_IP:8001/v1/models should fail (timeout or "no route to host"). Your LiteLLM is not exposed to the client's tailnet.
  • From client's tailnet: curl http://YOUR_NODE1_TAILSCALE_IP:8000/v1/models should fail. The unauthenticated vLLM endpoint is not reachable from the client's tailnet.
  • From client's tailnet: curl http://YOUR_NODE1_TAILSCALE_IP:8080 should fail. Your Open WebUI is not reachable from the client's tailnet.
  • From your tailnet: curl http://YOUR_NODE2_TAILSCALE_IP:8080 should fail. The client's Open WebUI is not reachable from your tailnet.
  • spark-01: tail -f ~/spark-ai-stack/logs/litellm.log while the client sends a chat message → your log is silent. Client's traffic does not enter your stack.

DAC traffic during inference

  • spark-02: nload enp1s0f0np0 spikes into Gb/s during decoding from either side — both your and the client's inference requests traverse the DAC (yours for NCCL collectives, theirs for both the API call to 198.51.100.1:8000 and NCCL).
  • Both sides simultaneously: have you and the client send a long prompt at the same time. Both completions should stream concurrently — vLLM's continuous batching handles the overlap.

Config validation

  • spark-01 (your Hermes config): YAML validation passes — a parse error will silently break Hermes by falling back to .env:
    bash
    python3 -c "import yaml; yaml.safe_load(open('/home/YOUR_USERNAME/.hermes/config.yaml')); print('YAML valid')"
    Expected output: YAML valid
REF

Port reference

spark-01 (your node)

8000
vLLM head
Shared compute — localhost + DAC peer only · NO TAILSCALE
6379
Ray GCS
DAC peer only
8001
Your LiteLLM
Your tailnet only · your master_key
8080
Your Open WebUI
Your tailnet only
5678
Your n8n
Your tailnet only
9119
hermes-dashboard
Hermes built-in dashboard — config, keys, sessions · Your tailnet only
8265
Ray dashboard
Optional · localhost only

spark-02 (client node)

vLLM worker
No API listener — Ray join over DAC only
8001
Client LiteLLM
Client's tailnet only · client's master_key
8080
Client Open WebUI
Client's tailnet only
5678
Client n8n
Client's tailnet only
Port 8000 on spark-01 must NEVER be advertised to either tailnet. The client's only path to it is through their own LiteLLM, which calls it over the DAC. If port 8000 leaks onto a tailnet, the client (or anyone else on that tailnet) gets unauthenticated inference access. The host firewall rules in Step 06 enforce this.
REF

File locations

spark-01 — your node

~/spark-vllm-docker/ # cloned eugr/spark-vllm-docker — vLLM launcher
├── .env # LOCAL_IP, ETH_IF, IB_IF, CONTAINER_HF_TOKEN — written by --discover
├── build-and-copy.sh # builds vllm-node-tf5:latest + copies to spark-02
├── run-recipe.sh # launches Ray + vllm serve from the chosen recipe
└── launch-cluster.sh # stop/teardown helper invoked by systemd

~/spark-ai-stack/
├── litellm-config.yaml # YOUR LiteLLM — your master_key, localhost:8000
├── n8n.yml # compose for your n8n
└── logs/
    ├── litellm.db # YOUR SQLite request log
    └── litellm.log # YOUR text log

~/.hermes/ # YOUR Hermes config, memory, skills
~/workspace/ # YOUR Hermes file workspace

Docker volumes:
├── open-webui # YOUR Open WebUI database, knowledge bases, RAG
└── n8n_data # YOUR n8n flows + credentials

/etc/systemd/system/
├── vllm-cluster.service # vLLM cluster auto-start (head SSHes to worker)
├── litellm.service # YOUR LiteLLM auto-start
├── litellm.service.d/override.conf # PYTHONPATH only
├── hermes-gateway.service # YOUR Telegram gateway auto-start
└── hermes-dashboard.service # YOUR Hermes dashboard auto-start

spark-02 — client node (separate ownership)

~/.cache/huggingface/hub/ # model weights rsync'd from spark-01 over DAC
    └── models--Qwen--Qwen3.5-122B-A10B-FP8/

~/spark-ai-stack/
├── litellm-config.yaml # CLIENT LiteLLM — client's master_key, points DAC → vLLM
├── n8n.yml # compose for client's n8n
└── logs/
    ├── litellm.db # CLIENT SQLite log — separate corpus
    └── litellm.log # CLIENT text log

Docker image (pushed from spark-01):
└── vllm-node-tf5:latest # identical image as spark-01; worker container started by spark-01's launcher over SSH

Docker volumes:
├── open-webui # CLIENT Open WebUI database, knowledge bases, RAG
└── n8n_data # CLIENT n8n flows + credentials

/etc/systemd/system/
└── litellm.service # CLIENT LiteLLM auto-start

Backup targets

Node / OwnerPathContents
spark-01 — you~/spark-ai-stack/logs/Your LiteLLM corpus
spark-01 — you~/.hermes/Your Hermes memory, sessions, skills
spark-01 — youDocker volumes open-webui, n8n_dataYour UI state — chat history, knowledge bases, flows
spark-02 — client~/spark-ai-stack/logs/Client's LiteLLM corpus (their backup, not yours)
spark-02 — clientDocker volumes open-webui, n8n_dataClient's UI state (their backup, not yours)
Backups are owner-specific. Don't pull the client's volumes into your backup pipeline — that defeats the application-layer isolation. Each owner backs up their own node's state.
REF

Cluster issues

Two-node-specific failures and their resolutions. Most of these were discovered during live deployment.

vLLM inference is extremely slow (2–3 tok/s) on multi-node setup
SymptomThe cluster comes up cleanly, both GPUs register, and completions stream — but throughput sits at 2–3 tok/s instead of the expected ~45 tok/s. nload on the DAC shows traffic, but only a fraction of link capacity.
CauseNCCL is falling back to TCP/IP sockets instead of using RDMA over RoCE. Either /dev/infiniband wasn't passed into the container, NCCL_IB_HCA wasn't set, or both. The hand-built vllm-spark:26.04 image had this problem by design — RDMA passthrough was never plumbed in.
FixThe eugr/spark-vllm-docker stack (Step 1b–1e) sets NCCL_IB_HCA=rocep1s0f0,roceP2p1s0f0 and passes /dev/infiniband into both containers automatically. If you're on the old hand-built image, migrate — there is no equivalent flag fix. Confirm the transport:
bash · spark-01
docker logs vllm_node | grep -E "NET/IB|NET/Socket"
# NET/IB  = RDMA (good)
# NET/Socket = TCP fallback (bad)
NVFP4 model stalls silently after worker spawn (37 GB allocated, no further progress)
SymptomAn NVFP4 checkpoint (e.g. Sehyo/Qwen3.5-122B-A10B-NVFP4) loads weights to ~37 GB on both nodes, the worker registers with Ray, and then everything stops. No errors, no progress, no completion of engine init.
CauseNVFP4 quantization is currently single-node only on DGX Spark. Multi-node NVFP4 fails at cluster launch in both eugr/spark-vllm-docker (the recipe is marked solo_only: true) and vllm/vllm-openai:cu130-nightly. The Sehyo/Qwen3.5-122B-A10B-NVFP4 checkpoint additionally has weight-name mismatches with newer vLLM fused Mamba layer names.
FixUse Qwen/Qwen3.5-122B-A10B-FP8 for multi-node inference (Step 01 production model). Do not attempt NVFP4 with TP=2 multi-node until community support is confirmed.
Ray version mismatch between head and worker
SymptomWorker fails to join the head with an error referencing incompatible Ray client/server protocol versions, or the head accepts the worker but the engine init aborts with a serialization error.
CauseThe two containers were built from different base images (e.g. you rebuilt on spark-01 but the older image is still on spark-02) or one node is running an out-of-date image tag.
FixAlways start the head container first, wait for Ray runtime started in docker logs vllm_node, then start the worker. Both nodes must run identical images — re-run the copy step whenever you bump the image on either side:
bash · spark-01
cd ~/spark-vllm-docker
./build-and-copy.sh --tf5 --copy-to YOUR_NODE2_DAC_IP
Placement group allocation failed / node:192.0.2.x not found
SymptomvLLM logs show an indefinite placement-group allocation failure or a Ray placement spec referencing a node IP Ray doesn't recognize. The cluster has 2 GPUs registered, but vLLM only sees one.
CauseVLLM_HOST_IP is unset, so vLLM resolves the node's hostname to the mgmt IP (192.0.2.x) — but Ray registered each node on its DAC IP (198.51.100.x). The placement group spec then targets a node Ray has never seen.
FixSet VLLM_HOST_IP to each container's own DAC IP: on spark-01's head container, VLLM_HOST_IP=YOUR_NODE1_DAC_IP; on spark-02's worker container, VLLM_HOST_IP=YOUR_NODE2_DAC_IP. The eugr launcher's ./run-recipe.sh --discover (Step 1c) sets LOCAL_IP in .env per node and the cluster launcher exports it as VLLM_HOST_IP into each container.
"Tensor parallel size exceeds available GPUs (1)" warning
SymptomvLLM head logs print a warning that TP=2 exceeds the locally visible GPU count (1) before Ray finishes registering the second node.
CauseExpected and harmless on a 2-node 1-GPU-per-node setup. vLLM spreads TP workers across nodes via Ray. The warning fires before Ray confirms the second node's GPU is part of the cluster — not an error if the worker is up and joining.
FixIgnore. The eugr run-recipe.sh launcher blocks vllm serve behind this check, so under normal operation you won't see it. Confirm once the worker joins:
bash · spark-01
docker exec vllm_node ray status
# Expected: 2.0/2.0 GPU
Worker can't reach head — Ray GCS connection refused
SymptomWorker logs (docker logs vllm_node on spark-02) print "ConnectionError: ... 198.51.100.1:6379" continuously and never advance.
CauseEither (a) the head container hasn't started, (b) LOCAL_IP on the head's .env isn't the DAC IP so Ray bound to the wrong interface, or (c) a firewall blocks port 6379 over the DAC.
FixConfirm LOCAL_IP=YOUR_NODE1_DAC_IP in ~/spark-vllm-docker/.env on the head. Check Ray, restart if needed, then probe port 6379 from the worker:
bash · spark-01
docker exec vllm_node ray status
# if Ray isn't running:
cd ~/spark-vllm-docker && ./run-recipe.sh qwen3.5-122b-fp8 -d -- --served-model-name qwen3.5-122b --gpu-memory-utilization 0.80
bash · spark-02
nc -zv YOUR_NODE1_DAC_IP 6379
# should print: Connection to YOUR_NODE1_DAC_IP 6379 port [tcp/*] succeeded
NCCL traffic on the wrong interface (mgmt LAN instead of DAC)
SymptomTensor-parallel inference works but is slow. nload enp1s0f0np0 stays idle; mgmt LAN sees Gb/s spikes instead.
CauseEither (a) NCCL_IB_HCA wasn't set so NCCL couldn't find the RoCE devices and fell back to TCP auto-detect, picking the first interface it saw (often mgmt), or (b) only one of the two RoCE twins is listed in NCCL_IB_HCA and NCCL silently went to TCP rather than use one twin.
FixConfirm IB_IF=rocep1s0f0,roceP2p1s0f0 (both twins, comma-separated) and ETH_IF=enp1s0f0np0 in ~/spark-vllm-docker/.env on both nodes. Verify NCCL picked up the RoCE devices:
bash · both nodes
docker logs vllm_node | grep "NCCL INFO NET/IB"
# Expected: both rocep1s0f0 and roceP2p1s0f0 listed
nvidia-smi memory.used returns [N/A] on GB10
Symptomnvidia-smi --query-gpu=memory.used,memory.total --format=csv returns [N/A] in both fields. Memory monitoring scripts that depend on these fields silently report nothing.
CauseGB10 (Grace Blackwell) uses unified memory. The classic memory.used / memory.total NVML fields are not populated on this hardware.
FixUse plain nvidia-smi and read the Processes section. With the 122B FP8 model loaded you should see a vllm / RayWorkerWrapper process at ~61 GB per node; 35B FP8 lands closer to ~97 GB.
bash · both nodes
nvidia-smi
# Read the Processes section — memory.used / memory.total fields return [N/A] on GB10
SSH by hostname hits the DAC instead of mgmt
Symptomssh spark-02 from spark-01 hangs or returns "connection refused", even though both nodes are reachable.
CauseBy default, each node's hostname resolves to its DAC IP (198.51.100.x). Unless you've explicitly bound sshd to the DAC interface, SSH only listens on the mgmt LAN — but your client has resolved the hostname to the DAC IP.
FixAdd mgmt-IP entries to /etc/hosts on both nodes (covered in Prerequisites). The HF cache rsync uses the DAC IP explicitly, but every other ssh spark-0X command relies on hostname resolution.
bash · both nodes
echo "YOUR_NODE1_MGMT_IP  spark-01" | sudo tee -a /etc/hosts
echo "YOUR_NODE2_MGMT_IP  spark-02" | sudo tee -a /etc/hosts
docker: permission denied on spark-02
SymptomAny docker command on spark-02 fails with permission denied while trying to connect to the Docker daemon socket. Common on a freshly-imaged worker node where docker installed cleanly but the operator account isn't in the docker group yet.
FixAdd the operator account to the docker group. The new group only takes effect after re-logging or running newgrp docker in the current shell.
bash · spark-02
sudo usermod -aG docker YOUR_USERNAME
newgrp docker
LiteLLM httpx.ConnectError on startup
SymptomLiteLLM systemd service won't start. Logs show httpx.ConnectError against either Postgres or a model endpoint.
CauseTwo common causes: (1) a stale STORE_MODEL_IN_DB=True + DATABASE_URL in /etc/systemd/system/litellm.service.d/override.conf still pointing at a Postgres that no longer runs, or (2) litellm-config.yaml includes a model with api_base pointing at a dead port (e.g. localhost:8002 from a previous dual-model setup).
FixReset the override.conf to contain only PYTHONPATH (see Step 02). Audit litellm-config.yaml for any localhost:8002 or other dead endpoints and remove them — every model in model_list must have a live api_base.
vLLM container fails to start after power loss — HF DNS resolution error
SymptomAfter a power outage, the vllm_node container on spark-01 exits immediately. docker logs vllm_node shows repeated [Errno -3] Temporary failure in name resolution against huggingface.co, followed by huggingface_hub.errors.LocalEntryNotFoundError and an OSError about being unable to connect to HF to load files.
CauseOn container start, vLLM attempts to contact huggingface.co to check for model config updates — even when the weights are already in the local cache. If Docker's internal DNS resolver hasn't recovered by the time the container starts (common immediately after a power cycle), the resolution fails and vLLM exits rather than falling back to the cached files.
FixSet HF_HUB_OFFLINE: 1 and TRANSFORMERS_OFFLINE: 1 in the recipe env block (see Step 1e). These vars were added to qwen3.5-122b-fp8.yaml as the permanent fix. If the container is already failing and the recipe hasn't been patched yet, apply the patch and restart the service:
bash · spark-01
sudo systemctl restart vllm-cluster.service
sudo journalctl -u vllm-cluster.service -f
Always check that spark-02's vllm_node worker is running before restarting the service on spark-01 — the launcher SSHes into spark-02 to start the worker, and if the worker container is down it must be started first or the cluster won't form.
Client LiteLLM can't reach vLLM
SymptomClient LiteLLM on spark-02 returns connection refused or connect timeout on every request. ~/spark-ai-stack/logs/litellm.log on spark-02 shows httpx.ConnectError against 198.51.100.1:8000.
CauseEither (a) the DAC link is down, (b) the vLLM head container on spark-01 is not running, or (c) the host firewall on spark-01 is blocking the DAC peer from reaching port 8000.
FixThe Step 06 ufw rule should explicitly allow YOUR_NODE2_DAC_IP on port 8000 over enp1s0f0np0. Work through the chain:
bash · spark-02
ping -c 3 YOUR_NODE1_DAC_IP          # DAC link reachable?
nc -zv YOUR_NODE1_DAC_IP 8000        # port open?
bash · spark-01
docker ps | grep vllm_node           # container running?
sudo ufw status | grep 8000          # firewall rule present?
Client sees spark-01:8000 directly without auth
SymptomFrom the client's tailnet, curl http://YOUR_NODE1_TAILSCALE_IP:8000/v1/models succeeds — meaning the client could bypass their LiteLLM and hit the unauthenticated vLLM directly.
CausevLLM has no auth. Either (a) Tailscale ACLs on spark-01 inadvertently expose port 8000, or (b) the host firewall on spark-01 is allowing inbound on tailscale0 for port 8000, or (c) ufw is disabled.
FixAudit the ACL JSON on spark-01's tailnet — there must be no tag:owner:8000 entry anywhere. Only the explicit DAC-peer rule should appear for port 8000. If you find a leak, fix the ACL and ufw, then re-run the negative tests in the validation checklist.
bash · spark-01
sudo ufw status verbose | grep 8000
# Only this rule should appear for 8000:
# ALLOW IN  enp1s0f0np0  from YOUR_NODE2_DAC_IP  to any port 8000
Both LiteLLMs collide on port 8001 over the mgmt LAN
SymptomYou expected to curl http://YOUR_NODE2_MGMT_IP:8001/v1/models from spark-01 and get the client's LiteLLM, and it works — leaking the client's API surface to your mgmt LAN.
CauseBy design, both LiteLLMs bind to 0.0.0.0:8001 so each is reachable from its own tailnet. Your firewall must restrict inbound on the mgmt-LAN interface to deny port 8001 cross-node — Tailscale ACLs alone don't help here because the mgmt LAN is not part of any tailnet.
FixThe Step 06 ufw rules restrict 8001 to tailscale0 only. Confirm no rule allows 8001 on the wildcard or mgmt interface:
bash · both nodes
sudo ufw status verbose | grep 8001
# Should only show: ALLOW IN  tailscale0  to any port 8001
Client can see your LiteLLM logs (or vice versa)
SymptomYou find an entry in ~/spark-ai-stack/logs/litellm.log on spark-01 that you didn't make. Or the client reports a request in their log that they didn't send.
CauseAlmost always: someone reused a master key across the two LiteLLMs, or the Open WebUI on one side is misconfigured to point at the other side's LiteLLM.
FixConfirm master_key values are unique on each node — they should never match. Then check each Open WebUI's API base URL: yours should be http://host.docker.internal:8001/v1 (your local LiteLLM); client's should be the same hostname (their local LiteLLM, not your tailnet IP). If you find cross-pointing, fix it and rotate both master keys.
235B FP8 model OOM on the 256 GB cluster
SymptomLoading Qwen/Qwen3-235B-A22B-FP8 at TP=2 fails. Ray kills the worker with an OutOfMemoryError at the ~95% memory threshold during weight loading; the head logs report a placement-group failure shortly after.
Cause235B at FP8 is ~235 GB of weights total. Split across TP=2 that's ~117.5 GB per node — and only ~121 GB is usable per node, leaving no room for KV cache or OS overhead.
FixUse a smaller model — the 122B FP8 production model uses ~61 GB per node and fits comfortably with headroom for KV cache. The 235B FP8 does not fit at TP=2 on this hardware.
huggingface-cli is deprecated
SymptomRunning huggingface-cli download … prints a deprecation notice, or the command is missing on a fresh install of huggingface_hub.
FixUse the new CLI: hf download <model> --local-dir <path>. Step 01 has been updated to use hf; if you have an older snippet around, swap the binary name and the flags map cleanly.
Model "thinks" out loud / verbose deliberation in Open WebUI
SymptomConversational queries cause the model to produce long deliberative preambles ("Let me think about this..." or visible chain-of-thought). Latency climbs and token usage balloons. Sometimes the model loops in its own reasoning trace.
CauseThe chat template was rendered with preserve_thinking: true (or any other flag that exposes the visible CoT track). The Qwen3.5 family produces extended reasoning whenever thinking is enabled at template time, and conversational queries trigger excessive deliberation under that setting.
FixPass --default-chat-template-kwargs '{"enable_thinking": false}' as a vLLM flag after the -- separator on the run-recipe.sh launch command (the Step 1e launch line does this). Users can opt into extended reasoning per-message by prefixing a prompt with /think. This matches the behavior the recommended Open WebUI system prompt (Step 05) is calibrated for.
LiteLLM UI shows: table public.LiteLLM_UserTable does not exist
SymptomThe LiteLLM Admin UI loads but every page (Users, Virtual Keys, Models) returns an error referencing a missing LiteLLM_* table in the public schema.
CauseThe Postgres database is up and the connection succeeded, but the Prisma schema has not been pushed yet — the database is empty.
FixPush the Prisma schema, then restart LiteLLM. See Step 2d / 3d.
bash
DATABASE_URL="postgresql://litellm:litellm@localhost:5432/litellm" \
  prisma db push \
  --schema /home/YOUR_USERNAME/.local/lib/python3.12/site-packages/litellm/proxy/schema.prisma

sudo systemctl restart litellm
LiteLLM UI shows: Not connected to DB
SymptomThe Admin UI banner reports "Not connected to DB" even though Postgres is running and reachable on localhost:5432.
CauseEither (a) database_url is at the top level of litellm-config.yaml instead of nested under general_settings — LiteLLM only reads it from general_settings, or (b) the config still points at SQLite. SQLite is not supported for the UI; the Prisma schema is hardcoded for PostgreSQL.
FixMove database_url under general_settings in litellm-config.yaml, then restart and confirm:
yaml
general_settings:
  master_key: YOUR_MASTER_KEY
  database_url: "postgresql://litellm:litellm@localhost:5432/litellm"
bash
sudo systemctl restart litellm
journalctl -u litellm -n 50 --no-pager
# Look for: Successfully connected to postgres DB
Virtual key not found in LiteLLM_VerificationTokenTable
SymptomA virtual key generated from the UI is rejected on use with a VerificationToken not found error, but the key is still visible in the Virtual Keys tab.
CauseThe key was generated against the database before the schema was fully initialized (e.g. you generated a key, then re-ran prisma db push, which dropped/recreated the verification token table).
FixDelete the orphaned key in the UI, restart LiteLLM, then generate a fresh key. The new key will be inserted into the current schema and will validate correctly.
bash
sudo systemctl restart litellm
REF

Other known issues

MTP speculative decoding breaks all clients in vLLM v0.19.0
SymptomEvery request from every client (Open WebUI, Hermes, n8n, direct API) fails with HTTP 500. LiteLLM logs show:
text
litellm.APIConnectionError: OpenAIException - The min_p and logit_bias
sampling parameters are not yet supported with speculative decoding.
Cause--speculative-config '{"method":"qwen3_next_mtp",...}' is active in the recipe. Open WebUI and Hermes send min_p and logit_bias sampling parameters by default. vLLM rejects any request containing these params when speculative decoding is enabled — there is no per-request fallback, so the entire API surface breaks simultaneously.

This is a known vLLM bug affecting Qwen3.5-class models: vllm-project/vllm#35800
FixRemove --speculative-config from the recipe, then restart the cluster:
bash · spark-01
sed -i '/speculative-config/d' ~/spark-vllm-docker/recipes/qwen3.5-122b-fp8.yaml
grep "speculative" ~/spark-vllm-docker/recipes/qwen3.5-122b-fp8.yaml || echo "Removed OK"
Stop the worker first, then restart the service:
bash · spark-02
docker stop vllm_node && docker rm vllm_node
bash · spark-01
sudo systemctl restart vllm-cluster.service
Weight reload takes ~4 minutes. Monitor with docker logs -f vllm_node on spark-01.
NVFP4 quantization is single-node only on DGX Spark
LimitationNVFP4 multi-node cluster mode is not supported as of vLLM cu130-nightly / eugr spark-vllm-docker. The NVFP4 recipe is marked solo_only: true in eugr/spark-vllm-docker, and vllm/vllm-openai:cu130-nightly fails at cluster launch when NVFP4 is selected with TP=2 multi-node.
WorkaroundDo not attempt NVFP4 with TP=2 multi-node until community support is confirmed. Use Qwen/Qwen3.5-122B-A10B-FP8 for multi-node inference — the Step 01 production recipe is built around this. See the NVFP4 stall entry in Cluster issues for the symptom on the bad path.
LiteLLM port already in use
Symptomsystemd shows [Errno 98] address already in use
FixKill the stale process and restart the service:
bash
pkill -f litellm
sudo systemctl restart litellm
hermes: command not found
SymptomHermes installed but shell can't find it
FixReload the shell rc, or invoke with the full path:
bash
source ~/.bashrc
# or invoke directly:
/home/YOUR_USERNAME/.local/bin/hermes --version
Hermes gateway systemd install fails
Symptomsudo: hermes: command not found
FixUse the absolute path when running the installer as root:
bash
sudo /home/YOUR_USERNAME/.local/bin/hermes gateway install --system
sudo systemctl start hermes-gateway
sudo systemctl status hermes-gateway --no-pager
MCP stdio servers fail health check — npx not in LiteLLM's PATH
SymptomAdding a stdio-based MCP server (e.g. Brave Search) via the LiteLLM UI succeeds but the health check after saving fails. No error is shown — the check just doesn't pass.
CauseNode.js was installed via nvm or a user-local installer, which places binaries in ~/.local/bin and adds that path to the user's shell rc file (.bashrc). LiteLLM — running as a systemd service — never sources .bashrc, so it gets a clean environment with no ~/.local/bin in PATH and cannot find npx when it tries to spawn the stdio MCP server process.
FixEither symlink the user-local binaries into a system-wide path, or install Node system-wide via NodeSource. After either fix, retry the health check in the LiteLLM UI.
bash · option A — symlink existing install
sudo ln -sf /home/YOUR_USERNAME/.local/bin/node /usr/local/bin/node
sudo ln -sf /home/YOUR_USERNAME/.local/bin/npx  /usr/local/bin/npx
bash · option B — install Node system-wide via NodeSource
curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash -
sudo apt install -y nodejs
Signal on arm64 — not supported, use Telegram
Symptomsignal-cli 0.14.x requires Java 25 (not in Ubuntu 24.04 apt repos). Signal servers block datacenter/home-lab IPs at TLS level regardless of Java version.
FixUse Telegram. Telegram has no IP restrictions and native arm64 support. This has been evaluated and deprioritized — Telegram is the only supported remote messaging gateway on this stack.
Power outage recovery — orphaned container grabs port 8000
SymptomAfter a power outage, vllm-cluster.service fails to start on spark-01. docker logs vllm_node shows address already in use :8000 or the Ray cluster comes up but the API server never binds.
CauseAn old standalone container with restart: always on port 8000 (e.g. a hand-run vllm-qwen container predating vllm-cluster.service) grabbed the port on boot before vllm-cluster.service fired. Any container with --restart=always on port 8000 will cause this.
FixIdentify and remove the offending container:
bash · spark-01
# Find what's holding port 8000
docker ps --filter "publish=8000" --format "table {{.Names}}\t{{.Image}}\t{{.Ports}}"

# If a non-vllm_node container appears, stop and remove it:
docker stop <container-name> && docker rm <container-name>

# Then start the cluster:
sudo systemctl restart vllm-cluster.service
docker logs -f vllm_node

Correct restart policies for this stack:
vllm_noderestart: no (systemd-managed — must never self-restart)
n8n, open-webui, litellm-dbrestart: unless-stopped

Post-outage startup order: (1) confirm spark-02 vllm_node is absent (docker ps | grep vllm_node), (2) spark-01: sudo systemctl restart vllm-cluster.service, (3) monitor: docker logs -f vllm_node.
config.yaml YAML error causes silent .env fallback
SymptomAfter manually editing ~/.hermes/config.yaml, Hermes seems to ignore the changes — MCP is broken and the provider reverts to default. No obvious error in the logs.
CauseA YAML syntax error causes Hermes to silently fall back to .env values. No error is shown at startup, so the misconfiguration is invisible.
FixAlways validate YAML after editing. If the command exits without printing YAML valid, fix the syntax before restarting.
bash
python3 -c "import yaml; yaml.safe_load(open('/home/YOUR_USERNAME/.hermes/config.yaml')); print('YAML valid')"
Hermes crashes on startup — MCP server connection failing
SymptomHermes exits during startup with a stack trace mentioning mcp_servers or a failed MCP client handshake. Logs reference an asyncio error or a TCP/stdio connection that never opened.
CauseThis stack proxies all MCP traffic through LiteLLM (see Step 02). Hermes does not need a direct mcp_servers block in ~/.hermes/config.yaml — it talks to LiteLLM's MCP endpoint over the LiteLLM API. A mcp_servers block in Hermes config is only correct in a direct-MCP (non-LiteLLM-proxied) topology, and on this stack it points Hermes at servers it can't reach.
FixRemove the mcp_servers block from ~/.hermes/config.yaml entirely. MCP tool calls continue to work because LiteLLM is in the path. Validate and restart:
bash
python3 -c "import yaml; yaml.safe_load(open('/home/YOUR_USERNAME/.hermes/config.yaml')); print('YAML valid')"
systemctl --user restart hermes
custom_providers indentation error at startup
SymptomHermes fails to start with a YAML parse error after adding a custom provider to ~/.hermes/config.yaml.
CauseList items under custom_providers must be indented exactly 2 spaces. A common mistake is using 4 spaces or no indentation.
Fix Use this exact indentation:
yaml
custom_providers:
  - name: MyProvider
    base_url: http://localhost:8001/v1
    model: my-model
Hermes: hermes command not found after install
SymptomThe installer completed successfully but the shell reports hermes: command not found.
CauseThe installer appends PATH="$HOME/.local/bin:$PATH" to ~/.bashrc. If you ran the installer via curl ... | bash in a non-login shell, the new PATH line is not active until you source it or open a new shell.
Fix
bash
source ~/.bashrc    # or open a new shell
Hermes: hermes setup fails with import error — wrong Python picked up
Symptomhermes setup exits immediately with a Python import error or ModuleNotFoundError.
CauseThe installer creates a venv at ~/.hermes/hermes-agent/venv/ using Python 3.11 via uv. If the symlink at ~/.local/bin/hermes is broken or points elsewhere, the wrong Python is used.
FixConfirm the symlink points into the venv:
bash
ls -la ~/.local/bin/hermes
# Should point to: ~/.hermes/hermes-agent/venv/bin/hermes
Hermes: gateway fails to start after system reboot
Symptomsudo systemctl status hermes-gateway shows a failed or activating state after reboot.
Causehermes-gateway.service may start before the network is fully up.
FixAdd network-online ordering to the unit file:
bash
# Add to the [Unit] section of /etc/systemd/system/hermes-gateway.service:
#   After=network-online.target
#   Wants=network-online.target
sudo systemctl daemon-reload && sudo systemctl restart hermes-gateway
Hermes: mini-swe-agent terminal tools not working
SymptomShell/terminal tool calls from Hermes fail or the mini-swe-agent submodule is reported as missing.
FixRe-initialize the submodule and reinstall it into the venv:
bash
cd ~/.hermes/hermes-agent
git submodule update --init --recursive
~/.hermes/hermes-agent/venv/bin/pip install -e ./mini-swe-agent
APPENDIX

Clustered Open WebUI / n8n (HA notes)

Both Open WebUI and n8n have HA modes available, but for a two-node home/lab setup the operational complexity is not worth it. This stack runs them as single instances on spark-02. If you ever want to pursue HA, here are the pointers.

Open WebUI HA

  • Switch the open-webui container's storage from a Docker volume to a Postgres backend (env: DATABASE_URL=postgresql://...) and a shared filesystem for uploads and RAG documents.
  • Run multiple replicas behind a TCP load balancer. Sticky sessions are recommended for SSE chat streams.
  • Postgres can sit on either node; if you put it on spark-01 you'll re-introduce the very latency-disturbance pattern this architecture is designed to avoid. Prefer a third small box or a dedicated HA pair.

n8n HA

  • n8n's queue mode requires Postgres for state and Redis for the BullMQ queue. The main container becomes the main instance; one or more worker instances pull jobs off the queue.
  • Set EXECUTIONS_MODE=queue, QUEUE_BULL_REDIS_HOST=…, DB_TYPE=postgresdb, and the relevant Postgres env vars on every container. Replicas need the same N8N_ENCRYPTION_KEY.
  • For a two-node setup the simplest variant is one main on spark-02 and one worker on a third small box, with Postgres + Redis colocated on the third box.
  • Webhook traffic should hit only the main container; long-running executions land on workers transparently.
If you genuinely need HA at this layer, the operational answer is usually "add a third box for Postgres + Redis," not "split across the two GPU nodes." Putting stateful services on the inference nodes will compromise either inference latency or HA availability — usually both.
OBSIDIAN VAULT SYNC + MCP

Architecture

Obsidian vault sync runs on a dedicated Ubuntu 24.04 LXC (hostname: YOUR_LXC_HOSTNAME, IP: YOUR_LXC_IP) on the Proxmox homelab. Syncthing syncs the vault from Mac/Android to /vault on the LXC. @modelcontextprotocol/server-filesystem exposes the vault as a filesystem MCP server, wrapped by supergateway with streamableHttp transport on port 3000. LiteLLM on spark-01 connects to it at http://YOUR_LXC_IP:3000/mcp.

Why not obsidian-mcp? obsidian-mcp (StevenStavrakis) requires the Obsidian desktop app to be running in the LXC — not viable headless. @modelcontextprotocol/server-filesystem exposes the vault directory directly with no app dependency.
Obsidian (Mac/Android) <-> Syncthing <-> /vault on LXC <-> server-filesystem <-> supergateway :3000 <-> LiteLLM MCP client

LXC setup

SettingValue
OSUbuntu 24.04 LTS
HostnameYOUR_LXC_HOSTNAME
IPYOUR_LXC_IP
Vault path/vault
Servicessyncthing@root and obsidian-mcp — both enabled as systemd services
OBSIDIAN VAULT SYNC + MCP

LXC setup script

Run as root on a fresh Ubuntu 24.04 LXC.

bash
#!/bin/bash
set -e

apt update
apt install -y curl gpg apt-transport-https

# Node.js 22
curl -fsSL https://deb.nodesource.com/setup_22.x | bash -
apt install -y nodejs

# Syncthing
curl -fsSL https://syncthing.net/release-key.gpg | \
  gpg --dearmor -o /usr/share/keyrings/syncthing-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/syncthing-archive-keyring.gpg] \
  https://apt.syncthing.net/ syncthing stable" \
  > /etc/apt/sources.list.d/syncthing.list
apt update && apt install -y syncthing

# Vault directory
mkdir -p /vault

# Syncthing — After=network-online.target prevents a race where the service starts
# before the IP is assigned, causing the first vault sync to fail on boot
systemctl enable syncthing@root
systemctl start syncthing@root
sleep 8

# Expose Syncthing GUI on all interfaces
CONFIG_PATH=$(find /root -name "config.xml" 2>/dev/null | grep syncthing | head -1)
sed -i 's|<address>127.0.0.1:8384</address>|<address>0.0.0.0:8384</address>|' "$CONFIG_PATH"
systemctl restart syncthing@root

# @modelcontextprotocol/server-filesystem + supergateway
npm install -g @modelcontextprotocol/server-filesystem supergateway

# systemd service — streamableHttp transport, stateless (no --stateful), protocol 2024-11-05
cat > /etc/systemd/system/obsidian-mcp.service << 'EOF'
[Unit]
Description=Obsidian MCP Server (server-filesystem)
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=root
ExecStart=supergateway \
  --stdio "npx -y @modelcontextprotocol/server-filesystem /vault" \
  --port 3000 \
  --outputTransport streamableHttp
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable obsidian-mcp
systemctl start obsidian-mcp

After the LXC is running

  • Configure Syncthing on the LXC (port 8384) to accept a share from your Mac/Android
  • Set the shared folder path to /vault
  • Connect Mac Obsidian Syncthing client to the LXC Syncthing device

LiteLLM connection

Via litellm-config.yaml at the top level (not nested under litellm_settings). The protocol_version must be explicit — server-filesystem ignores 2025-11-25 and the handshake fails silently without it:

yaml
mcp_servers:
  - name: obsidian
    url: http://YOUR_LXC_IP:3000/mcp
    transport: streamableHttp
    protocol_version: "2024-11-05"
The server is stateless — no mcp-session-id header is needed. Clients must send Accept: application/json, text/event-stream — missing this returns -32000 Not Acceptable. LiteLLM sends the correct header automatically when transport: streamableHttp is set.
Verify: curl -s -H "Accept: application/json, text/event-stream" http://YOUR_LXC_IP:3000/mcp — expected: JSON with method: "initialize" response. A -32000 error means the Accept header is missing.
OBSIDIAN VAULT SYNC + MCP

Known issues

1 — -32000 Not Acceptable on MCP POST requests
SymptomPOSTs to /mcp return {"error": {"code": -32000, "message": "Not Acceptable"}}.
FixThe request is missing the Accept: application/json, text/event-stream header. LiteLLM sends it automatically — this error usually means a client (curl test, custom integration) is omitting it. Add -H "Accept: application/json, text/event-stream" to any manual curl calls.
2 — Protocol version mismatch / silent handshake failure
SymptomMCP server connects but tool calls return no results or the session never initializes.
FixSet protocol_version: "2024-11-05" in the LiteLLM MCP server config. @modelcontextprotocol/server-filesystem does not respond correctly to 2025-11-25 — the handshake succeeds at the transport level but tool listing returns empty.
3 — Syncthing fails on boot (IP not assigned in time)
SymptomVault sync works when Syncthing is started manually but fails after reboot.
FixThe default After=network.target races IP assignment on boot. The setup script uses After=network-online.target + Wants=network-online.target — verify these are present in the running unit: systemctl cat syncthing@root | grep -E "After|Wants". If they are missing, edit /etc/systemd/system/syncthing@root.service or create a drop-in override.
4 — SSE transport (--outputTransport sse) breaks POST requests
FixOmitting --outputTransport streamableHttp defaults supergateway to SSE, which does not handle /mcp POSTs. Always specify --outputTransport streamableHttp explicitly.
5 — LXC disk corruption on network-backed storage
FixStoring the LXC root disk on a network share (e.g. CIFS/SMB) caused EXT4 I/O errors and disk corruption under write load. Use local storage or a reliable block storage backend for LXC root disks.
BRAVE SEARCH MCP

Web Search Tool Use

Adds live web search as a callable tool — the AI model can run Brave Search queries during a conversation in response to tool calls from Open WebUI and other clients. LiteLLM spawns the server on demand via stdio using an API key you supply.

Step 1 — Get a Brave Search API key

Go to api.search.brave.com, create a free account, and generate an API key under the Data for AI plan (free tier supports up to 2,000 queries/month).

Step 2 — Confirm Node.js is installed system-wide

The MCP server is launched via npx. If you completed the Playwright MCP setup, Node.js is already installed system-wide and this step is done. Otherwise:

bash
curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash -
sudo apt install -y nodejs
Verify: which npx — expected output: /usr/bin/npx

Step 3 — Add Brave Search MCP server in LiteLLM UI

Navigate to http://YOUR_NODE1_MGMT_IP:8001/ui → MCP Servers → Add New MCP Server. (LiteLLM lives on spark-01.)

FieldValue
Namebrave-search
Aliasbrave-search
Transport TypeStandard Input/Output (stdio)

Set Stdio Configuration (JSON) — replace YOUR_BRAVE_API_KEY with your actual key:

json
{
  "command": "npx",
  "args": [
    "-y",
    "@modelcontextprotocol/server-brave-search"
  ],
  "env": {
    "BRAVE_API_KEY": "YOUR_BRAVE_API_KEY"
  }
}

Save and confirm Health Status shows Healthy.

If the health check fails, verify that npx resolves to /usr/bin/npx (system-wide install) and not a user-local path. See the Known Issues section — MCP stdio servers fail health check — for the full diagnosis.

Validation

In Open WebUI, send the following prompt:

Use the brave-search tool to find the latest news about NVIDIA and summarize the top three results.

Expected: the model calls the brave_web_search tool (shown as "Explored" in Open WebUI) and returns a summary drawn from live search results.

PLAYWRIGHT MCP

Browser Automation

Adds browser automation tool use to the stack — the AI model can navigate pages, take screenshots, and scrape content via tool calls in Open WebUI and other clients. LiteLLM spawns a headless Chromium process on demand via stdio; no persistent port is required.

Step 1 — Install Node.js system-wide

LiteLLM runs as a systemd service and does not source .bashrc. Node.js must be installed system-wide so npx is available in LiteLLM's PATH.

bash
curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash -
sudo apt install -y nodejs
Verify: which npx — expected output: /usr/bin/npx
If Node.js was previously installed via nvm or a user-local installer, this step replaces it with a system-wide install. The Known Issues section documents the npx PATH problem in detail.

Step 2 — Install Playwright MCP Chromium browser

Chrome has no ARM64 build. Use Chromium, installed via the @playwright/mcp package's own browser installer — not via npx playwright install:

bash
npx @playwright/mcp install-browser chromium
Verify: ls ~/.cache/ms-playwright/ — expected: a chromium-XXXX directory is present.

Step 3 — Update litellm-config.yaml

Add model_info blocks to all model entries. Without these, LiteLLM does not advertise function calling support and tool calls will not execute.

yaml
model_list:
  - model_name: qwen3.5-122b
    litellm_params:
      model: openai/qwen3.5-122b
      api_base: http://localhost:8000/v1
      api_key: "not-needed"
      max_tokens: 8192
    model_info:
      supports_function_calling: true
      supports_tool_choice: true

Restart LiteLLM after saving:

bash
sudo systemctl restart litellm

Step 4 — Add Playwright MCP server in LiteLLM UI

Navigate to http://YOUR_NODE1_MGMT_IP:8001/ui → MCP Servers → Add New MCP Server. (LiteLLM lives on spark-01.)

FieldValue
Nameplaywright
Aliasplaywright
Transport TypeStandard Input/Output (stdio)

Set Stdio Configuration (JSON):

json
{
  "command": "npx",
  "args": [
    "-y",
    "@playwright/mcp@latest",
    "--browser",
    "chromium",
    "--headless"
  ]
}

Save and confirm Health Status shows Healthy.

Validation

In Open WebUI, send the following prompt:

Use the playwright tool to browse to cnn.com, take a screenshot, and tell me what you see.

Expected: the model calls the navigate and screenshot tools (shown as "Explored" in Open WebUI) and returns a summary of the page.

HOME ASSISTANT MCP

Home Assistant tool integration

Home Assistant exposes its own MCP endpoint at /api/mcp using Streamable HTTP transport. This lets Hermes (via LiteLLM) call HA as a tool — controlling devices, querying states, reading automations.

Step 1 — Enable the integration in Home Assistant

Settings → Devices & Services → Add Integration → search "Model Context Protocol Server"

FieldValue
IntegrationModel Context Protocol Server
Endpoint path/api/mcp (built into HA — no extra server needed)
TransportStreamable HTTP

Step 2 — Generate a long-lived access token

In Home Assistant: Profile → Long-Lived Access Tokens → Create Token. Copy the full token — it is only shown once.

Pass the token as the API_ACCESS_TOKEN environment variable when configuring the MCP server in LiteLLM, not inline as a bearer token header — inline tokens get truncated by the LiteLLM UI field length limit.

Step 3 — Add to LiteLLM MCP config

yaml · litellm-config.yaml
mcp_servers:
  - name: home-assistant
    url: http://YOUR_HA_IP:8123/api/mcp
    transport: streamableHttp
    headers:
      Authorization: "Bearer YOUR_HA_LONG_LIVED_TOKEN"
Verify: in Open WebUI or Hermes, ask the model to list your Home Assistant entities. Expected: device list returned as a tool call result.
TROUBLESHOOTING

LiteLLM Admin UI

The LiteLLM proxy ships with a built-in web UI at /ui. It requires a master key and a PostgreSQL database — SQLite is not supported for the UI auth layer. The following documents every error encountered during setup, in order.

Step 1 — Set a master key

All commands below run on spark-01 (where LiteLLM lives). Add to ~/spark-ai-stack/litellm-config.yaml:

yaml
general_settings:
  master_key: sk-yourkey
  database_url: "postgresql://litellm:litellm@localhost:5432/litellm"

Generate a secure key:

bash
echo "sk-$(openssl rand -hex 16)"

Step 2 — Add PostgreSQL to a compose file on spark-01

Run Postgres on the same node as LiteLLM. Putting it on spark-02 would re-introduce the very latency-disturbance pattern this architecture is designed to avoid. Add to a new ~/spark-ai-stack/litellm-db.yml on spark-01:

yaml
  litellm-db:
    image: postgres:16
    container_name: litellm-db
    restart: unless-stopped
    environment:
      - POSTGRES_USER=litellm
      - POSTGRES_PASSWORD=litellm
      - POSTGRES_DB=litellm
    ports:
      - "5432:5432"
    volumes:
      - litellm_db:/var/lib/postgresql/data

volumes:
  litellm_db:
bash
docker compose up -d litellm-db
restart: unless-stopped combined with sudo systemctl enable docker ensures the container survives reboots automatically — no additional systemd unit needed.

Step 3 — Install Prisma

LiteLLM uses Prisma as its database ORM. It is not included in the base pip install litellm package.

bash
pip install prisma --break-system-packages

--break-system-packages bypasses a Python 3.12 restriction that prevents pip from installing into the system Python environment. It is safe on a dedicated AI server where system tools do not depend on conflicting packages.

Step 4 — Generate Prisma binaries

After installing the package, the binaries must be generated from LiteLLM's bundled schema:

bash
cd ~/.local/lib/python3.12/site-packages/litellm/proxy
prisma generate --schema schema.prisma

Step 5 — Apply the database schema

The Postgres database exists but has no tables yet. Push the schema. DATABASE_URL must be passed inline — Prisma reads it directly from the environment, not from litellm-config.yaml.

bash
DATABASE_URL="postgresql://litellm:litellm@localhost:5432/litellm" \
prisma db push --schema schema.prisma

Step 6 — Restart LiteLLM

bash
sudo systemctl daemon-reload
sudo systemctl restart litellm
sudo systemctl status litellm

Errors encountered in order

ErrorCauseFix
Authentication Error, Not connected to DBNo PostgreSQL configuredAdd database_url to general_settings
ModuleNotFoundError: No module named 'prisma'Prisma not installedpip install prisma --break-system-packages
Unable to find Prisma binariesprisma generate not runRun prisma generate --schema schema.prisma
The table 'public.LiteLLM_UserTable' does not existSchema not applied to DBRun prisma db push --schema schema.prisma

Accessing the UI

Navigate to http://YOUR_NODE1_MGMT_IP:8001/ui. Username: admin. Password: your master_key value.

UI loads and accepts login with master key credentials
CHANGELOG

Project milestones

DateMilestone
May 1, 2026DGX Spark acquisition — Project Jiffy initiated
May 9–13Two-node clustering, 200 GbE/RoCE networking, vLLM bring-up
May 13INT4 → FP8 model migration; NCCL/RoCE fix → 18× throughput improvement (2–3 tok/s → 45 tok/s)
May 15MTP speculative decoding removed — unstable in vLLM v0.19.0, HTTP 500 on standard sampling params
May 16Hermes Agent deployed — Telegram gateway live
May 24–29n8n trading bot and market research workflows built
June 2Hermes desktop app v0.15.2 released
June 8Desktop app integration; dashboard basic auth; Home Assistant MCP confirmed working
June 11Obsidian MCP — obsidian-mcp → @modelcontextprotocol/server-filesystem migration
June 12Hermes memory limits raised (memory_char_limit 6000, user_char_limit 3000); dashboard auto-restart via ExecStartPost + sudoers
June 15Power outage recovery — orphaned vllm-qwen container (restart:always on port 8000) identified and removed
NVIDIA DGX: At-Home AI Stack · split-trust shared compute · arm64 · Ubuntu 24.04 · v3.0 vLLM TP=2 · Ray · 2× LiteLLM · 2× Open WebUI · 2× n8n · 2× Tailscale · Hermes Agent