Deploying Hermes Agent with a Local LLM on Leafcloud: Sovereign AI on European GPUs

There’s a category of AI agent emerging that’s qualitatively different from the chatbots and coding copilots that came before. They have persistent memory. They run continuously, not just when you open a tab. They write their own skills as they encounter problems, and remember those skills the next time. They live on a server somewhere, and you reach them via Telegram, Discord, Slack, or whatever messaging app you actually use.

Hermes Agent — built by Nous Research, MIT licensed, released earlier this year — is one of the cleanest implementations of this pattern. It’s a Python application that installs in one command, configures via a wizard, runs as a systemd service, and accumulates a model of you and your projects across every conversation.

The cumulative model is exactly the reason it matters where you host this thing. A workflow tool like n8n holds your credentials. An agent like Hermes holds something stranger and more personal: a working memory of your projects, your preferences, your past decisions, your half-finished ideas. Running that on infrastructure you don’t control is a different kind of trust exercise than running a static website.

There’s a second-order question that often gets skipped: even if Hermes itself runs in Europe on infrastructure you trust, every prompt it sends to OpenAI or Anthropic still leaves the continent. The sovereignty story is only half true if the model is hosted elsewhere. The other half lives on the GPU.

This guide deploys the whole stack on Leafcloud: Hermes on an Ubuntu VM, an open-weight Nous model running locally via vLLM on an attached RTX 6000 Blackwell, and the messaging gateway connecting your phone to your own private LLM. About forty minutes, end to end. No prompts leave the building.

Why run the full Hermes stack on Leafcloud

Sovereignty, properly. A persistent-memory agent calling a US-hosted model still ships every thought you’ve ever had through that model to the US. Local inference on EU infrastructure closes the loop — your data, your inference, your jurisdiction.

The economics of a Blackwell GPU. Renting top-end silicon by the hour from a Big Tech hyperscaler to run an 8B model wastes most of the card. Hermes is a low-throughput workload — one user, intermittent inference — so a single RTX 6000 Blackwell has plenty of headroom for the agent, the model, and a generous KV cache. The same GPU can serve other workloads alongside it. Our RTX 6000 Blackwell pricing starts at €2.35/hour committed.

View GPU pricing and options →

Heat reuse, not greenwashing. Our compute lives inside real buildings — apartment blocks, offices, swimming pools — where the waste heat from the GPU feeds the building’s heating loop. An RTX 6000 Blackwell under load draws around 300W TDP. Most of that ends up in hot water for the people upstairs instead of being blown into the sky.

Migration from OpenClaw. If you’ve already deployed OpenClaw on Leafcloud, Hermes ships with a tool that imports your memories, skills, command allowlists, API keys, and messaging configs in one command. The transition is painless.

What you’ll end up with

A single Ubuntu 22.04 VM with one RTX 6000 Blackwell attached (96 GB GDDR7 ECC, 5th-gen Tensor Cores), running:

Hermes installed natively under the ubuntu user via the official install script
vLLM in Docker, serving an open-weight Nous model on localhost:8000
All Hermes state (~/.hermes/) on a persistent block volume, surviving VM rebuilds
The messaging gateway running as a systemd service, autostarting on boot
A connection to Telegram (or Discord, Slack, WhatsApp, Signal — your pick) so you can reach the agent from your phone
Daily backups to Leafcloud object storage covering memory, skills, and OAuth tokens

No public ports beyond SSH. Hermes connects outbound to the messaging platforms. The model only listens on localhost. The agent only ever talks to your own GPU.

1. Provision the GPU VM

In the Leafcloud dashboard, launch a new instance:

Setting	Value
Image	`Ubuntu 22.04 LTS`
Flavor	Blackwell Pro (1× RTX 6000 Blackwell, 96 GB VRAM, 32 vCPU, 256 GB RAM, 2 TB NVMe)
Root volume	100 GB — model weights and Docker layers add up
Key pair	Your SSH key

Security group needs exactly one rule:

22/tcp — SSH, restricted to your IP range

Hermes doesn’t accept inbound connections, and the model only listens on localhost. You talk to your agent through Telegram, not through a public endpoint you have to defend.

Allocate a floating IP and SSH in as ubuntu.

2. Install the NVIDIA driver

sudo apt update && sudo apt upgrade -y
sudo apt install -y build-essential
sudo apt install -y nvidia-driver-550 nvidia-utils-550
sudo reboot

Reconnect and confirm:

nvidia-smi

You should see the RTX 6000 Blackwell listed with the driver version and CUDA runtime.

3. Install Docker and the NVIDIA Container Toolkit

Docker:

curl -fsSL https://get.docker.com | sudo sh
sudo usermod -aG docker $USER
newgrp docker

Then the Container Toolkit so containers can see the GPU:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Smoke test:

docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

If you see the RTX 6000 Blackwell from inside the container, the plumbing works.

4. Attach a persistent block volume

Two things need to live somewhere persistent: Hermes’s memory directory (~/.hermes/) and the model weights you’re about to download. Both go on a block volume so they survive container, image, and VM rebuilds.

Attach a 200 GB block volume to the instance from the dashboard. It’ll appear as /dev/vdb. Then on the VM:

sudo mkfs.ext4 /dev/vdb
sudo mkdir -p /var/hermes-data
echo '/dev/vdb /var/hermes-data ext4 defaults 0 2' | sudo tee -a /etc/fstab
sudo mount -a

sudo mkdir -p /var/hermes-data/hermes /var/hermes-data/models
sudo chown -R ubuntu:ubuntu /var/hermes-data
ln -s /var/hermes-data/hermes ~/.hermes

The symlink means Hermes still uses its conventional path; the actual data lives on the volume.

5. Run vLLM with an open-weight Nous model

Hermes is built by Nous, and Nous publishes open-weight models tuned for exactly this kind of agentic, tool-calling workload. The sensible defaults:

NousResearch/Hermes-3-Llama-3.1-8B to get started — fast, fits with room to spare, good for iteration.
NousResearch/Hermes-3-Llama-3.1-70B in AWQ 4-bit quantisation if you want significantly higher quality. The 96 GB GDDR7 on the RTX 6000 Blackwell leaves substantial headroom for KV cache and longer contexts — you can even step up to 8-bit quantisation for a quality bump if you want to spend the VRAM.

Start with the 8B. You can swap to the 70B later by changing one argument.

Launch vLLM in Docker:

docker run -d --restart unless-stopped \
  --name vllm \
  --gpus all \
  --ipc=host \
  -p 127.0.0.1:8000:8000 \
  -v /var/hermes-data/models:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model NousResearch/Hermes-3-Llama-3.1-8B \
  --served-model-name hermes-3 \
  --max-model-len 16384

A few things worth pointing out in that command:

-p 127.0.0.1:8000:8000 binds vLLM to localhost only. The model is never reachable from outside the VM.
--ipc=host is required for vLLM’s shared-memory communication.
The HuggingFace cache lives on the persistent volume, so the model survives container rebuilds — no re-downloading 16 GB every time.

First boot downloads the weights and warms up; tail the logs to watch:

docker logs -f vllm

You’re waiting for Uvicorn running on http://0.0.0.0:8000 (it’s bound to localhost via the port mapping despite what the log says).

Test it directly:

curl http://localhost:8000/v1/models

You should see hermes-3 in the response.

6. Install Hermes Agent

The install script handles everything — uv, Python 3.11, Node.js, ripgrep, ffmpeg — no sudo required:

curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash
source ~/.bashrc

Verify:

hermes --version
hermes doctor

7. Point Hermes at your local model

Run the setup wizard:

hermes setup

When it asks for an LLM provider, choose “Custom OpenAI-compatible endpoint” (or equivalent — wording varies by version) and enter:

Base URL: http://localhost:8000/v1
Model name: hermes-3
API key: anything non-empty (vLLM doesn’t check by default; set one if you want via --api-key on the vLLM container)

The wizard also walks you through personality, messaging gateway, and OpenClaw migration if you have one to import. Skip what you don’t need now — everything is reconfigurable later via hermes model, hermes gateway setup, etc.

Test the connection on the CLI:

hermes

Type something. The first response will take a moment as vLLM warms up the request path, then subsequent responses should be quick. Ctrl+C to exit.

8. Set up the messaging gateway

If you skipped the gateway during setup:

hermes gateway setup

For Telegram (the most common pick), this walks you through creating a bot via BotFather, pairing your account, and setting the allowed user list. Take your time on the allowed-users step — this is your “only I can talk to my agent” check.

Test in the foreground first:

hermes gateway

Send your bot a message from your phone. It should respond, going round-trip through Telegram → your VM → your RTX 6000 Blackwell → back. Ctrl+C to stop.

9. Install the gateway as a systemd service

Make it permanent:

hermes gateway install

This creates a user systemd unit that starts the gateway on boot, restarts on failure, and logs to the journal:

systemctl --user status hermes-gateway
journalctl --user -u hermes-gateway -f

Enable user-service lingering so it actually starts on boot rather than only when you log in:

sudo loginctl enable-linger ubuntu

Reboot the VM and confirm both the gateway and vLLM come back without manual intervention. Send a Telegram message to check end to end.

10. Back up `~/.hermes` to Leafcloud object storage

A daily restic job covering memory, skills, and OAuth tokens. Models live in /var/hermes-data/models and are re-downloadable from HuggingFace, so they don’t need to be in the backup set.

#!/bin/bash
# /usr/local/bin/hermes-backup.sh
set -euo pipefail

export RESTIC_REPOSITORY=s3:https://leafcloud.store/hermes-backups
export RESTIC_PASSWORD_FILE=/etc/restic/password
export AWS_ACCESS_KEY_ID=<object-storage-access-key>
export AWS_SECRET_ACCESS_KEY=<object-storage-secret-key>

restic backup /var/hermes-data/hermes
restic forget --keep-daily 7 --keep-weekly 4 --keep-monthly 6 --prune

Drop it in cron at 3 AM and you’re done.

11. Keeping the stack up to date

# Update Hermes
hermes update

# Update vLLM
docker pull vllm/vllm-openai:latest
docker stop vllm && docker rm vllm
# (re-run the vLLM launch command from step 5)

To swap models — to the 70B, for example — stop the vLLM container and relaunch with --model NousResearch/Hermes-3-Llama-3.1-70B-AWQ and add --quantization awq. Hermes doesn’t need to know; the served model name stays hermes-3.

What you’ve actually deployed

A persistent-memory AI agent and the model it runs on, both hosted on European infrastructure, both isolated to a single VM with no public attack surface beyond SSH, both backed up to object storage. Your prompts never leave the building. The waste heat from the GPU goes into a real building’s heating loop instead of a hyperscaler’s chiller.

If you outgrow a single GPU — multiple users, larger models, parallel sub-agents pushing throughput — the next steps are either the Blackwell Duo Pro (2× RTX 6000) or Blackwell Quad Pro (4× RTX 6000) for up to 384 GB of total VRAM, or Managed Kubernetes with vLLM behind a load balancer and Hermes as a separate workload. The same memory directory and the same hermes-3 served model name keep working.

Ready to give your AI agent a sovereign home?

Spin up an RTX 6000 Blackwell and run through this guide yourself. Sign up at leaf.cloud to get started, or book a call with our team if you want to talk through sizing, model selection, or pairing Hermes with other Leafcloud services. We answer.