Deploying Hermes Agent with a Local LLM on Leafcloud: Sovereign AI on European GPUs
A guide to self-hosting Nous Research's open-source AI agent on a Leafcloud A100 GPU instance — with the model running locally via vLLM, persistent memory on European infrastructure, and zero prompts leaving the EU.
By
Published on
There’s a category of AI agent emerging that’s qualitatively different from the chatbots and coding copilots that came before. They have persistent memory. They run continuously, not just when you open a tab. They write their own skills as they encounter problems, and remember those skills the next time. They live on a server somewhere, and you reach them via Telegram, Discord, Slack, or whatever messaging app you actually use.
Hermes Agent — built by Nous Research, MIT licensed, released earlier this year — is one of the cleanest implementations of this pattern. It’s a Python application that installs in one command, configures via a wizard, runs as a systemd service, and accumulates a model of you and your projects across every conversation.
The cumulative model is exactly the reason it matters where you host this thing. A workflow tool like n8n holds your credentials. An agent like Hermes holds something stranger and more personal: a working memory of your projects, your preferences, your past decisions, your half-finished ideas. Running that on infrastructure you don’t control is a different kind of trust exercise than running a static website.
There’s a second-order question that often gets skipped: even if Hermes itself runs in Europe on infrastructure you trust, every prompt it sends to OpenAI or Anthropic still leaves the continent. The sovereignty story is only half true if the model is hosted elsewhere. The other half lives on the GPU.
This guide deploys the whole stack on Leafcloud: Hermes on an Ubuntu VM, an open-weight Nous model running locally via vLLM on an attached A100, and the messaging gateway connecting your phone to your own private LLM. About forty minutes, end to end. No prompts leave the building.
View GPU pricing and options →
Why run the full Hermes stack on Leafcloud
Sovereignty, properly. A persistent-memory agent calling a US-hosted model still ships every thought you’ve ever had through that model to the US. Local inference on EU infrastructure closes the loop — your data, your inference, your jurisdiction.
The economics of an A100. Renting an A100 by the hour from a Big Tech hyperscaler to run an 8B model wastes most of the silicon. Hermes is a low-throughput workload — one user, intermittent inference — so a single GPU has plenty of headroom for the agent, the model, and a generous KV cache. The same GPU can serve other workloads alongside it.
Heat reuse, not greenwashing. Our compute lives inside real buildings — apartment blocks, offices, swimming pools — where the waste heat from the GPU feeds the building’s heating loop. An A100 under load draws around 400W. Most of that ends up in hot water for the people upstairs instead of being blown into the sky.
Migration from OpenClaw. If you’ve already deployed OpenClaw on Leafcloud, Hermes ships with a tool that imports your memories, skills, command allowlists, API keys, and messaging configs in one command. The transition is painless.
What you’ll end up with
A single Ubuntu 22.04 VM with one A100 attached, running:
- Hermes installed natively under the
ubuntuuser via the official install script - vLLM in Docker, serving an open-weight Nous model on
localhost:8000 - All Hermes state (
~/.hermes/) on a persistent block volume, surviving VM rebuilds - The messaging gateway running as a systemd service, autostarting on boot
- A connection to Telegram (or Discord, Slack, WhatsApp, Signal — your pick) so you can reach the agent from your phone
- Daily backups to Leafcloud object storage covering memory, skills, and OAuth tokens
No public ports beyond SSH. Hermes connects outbound to the messaging platforms. The model only listens on localhost. The agent only ever talks to your own GPU.
1. Provision the GPU VM
In the Leafcloud dashboard, launch a new instance:
| Setting | Value |
|---|---|
| Image | Ubuntu 22.04 LTS |
| Flavor | A100 GPU flavor (1× A100, 80 GB VRAM) |
| Root volume | 100 GB — model weights and Docker layers add up |
| Key pair | Your SSH key |
Security group needs exactly one rule:
22/tcp— SSH, restricted to your IP range
Hermes doesn’t accept inbound connections, and the model only listens on localhost. You talk to your agent through Telegram, not through a public endpoint you have to defend.
Allocate a floating IP and SSH in as ubuntu.
2. Install the NVIDIA driver
sudo apt update && sudo apt upgrade -y
sudo apt install -y build-essential
sudo apt install -y nvidia-driver-550 nvidia-utils-550
sudo reboot
Reconnect and confirm:
nvidia-smi
You should see the A100 listed with the driver version and CUDA runtime.
3. Install Docker and the NVIDIA Container Toolkit
Docker:
curl -fsSL https://get.docker.com | sudo sh
sudo usermod -aG docker $USER
newgrp docker
Then the Container Toolkit so containers can see the GPU:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Smoke test:
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
If you see the A100 from inside the container, the plumbing works.
4. Attach a persistent block volume
Two things need to live somewhere persistent: Hermes’s memory directory (~/.hermes/) and the model weights you’re about to download. Both go on a block volume so they survive container, image, and VM rebuilds.
Attach a 200 GB block volume to the instance from the dashboard. It’ll appear as /dev/vdb. Then on the VM:
sudo mkfs.ext4 /dev/vdb
sudo mkdir -p /var/hermes-data
echo '/dev/vdb /var/hermes-data ext4 defaults 0 2' | sudo tee -a /etc/fstab
sudo mount -a
sudo mkdir -p /var/hermes-data/hermes /var/hermes-data/models
sudo chown -R ubuntu:ubuntu /var/hermes-data
ln -s /var/hermes-data/hermes ~/.hermes
The symlink means Hermes still uses its conventional path; the actual data lives on the volume.
5. Run vLLM with an open-weight Nous model
Hermes is built by Nous, and Nous publishes open-weight models tuned for exactly this kind of agentic, tool-calling workload. The sensible defaults:
NousResearch/Hermes-3-Llama-3.1-8Bto get started — fast, fits with room to spare, good for iteration.NousResearch/Hermes-3-Llama-3.1-70Bin AWQ 4-bit quantisation if you want significantly higher quality. Fits comfortably on an 80 GB A100 with KV cache headroom.
Start with the 8B. You can swap to the 70B later by changing one argument.
Launch vLLM in Docker:
docker run -d --restart unless-stopped \
--name vllm \
--gpus all \
--ipc=host \
-p 127.0.0.1:8000:8000 \
-v /var/hermes-data/models:/root/.cache/huggingface \
vllm/vllm-openai:latest \
--model NousResearch/Hermes-3-Llama-3.1-8B \
--served-model-name hermes-3 \
--max-model-len 16384
A few things worth pointing out in that command:
-p 127.0.0.1:8000:8000binds vLLM to localhost only. The model is never reachable from outside the VM.--ipc=hostis required for vLLM’s shared-memory communication.- The HuggingFace cache lives on the persistent volume, so the model survives container rebuilds — no re-downloading 16 GB every time.
First boot downloads the weights and warms up; tail the logs to watch:
docker logs -f vllm
You’re waiting for Uvicorn running on http://0.0.0.0:8000 (it’s bound to localhost via the port mapping despite what the log says).
Test it directly:
curl http://localhost:8000/v1/models
You should see hermes-3 in the response.
6. Install Hermes Agent
The install script handles everything — uv, Python 3.11, Node.js, ripgrep, ffmpeg — no sudo required:
curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash
source ~/.bashrc
Verify:
hermes --version
hermes doctor
7. Point Hermes at your local model
Run the setup wizard:
hermes setup
When it asks for an LLM provider, choose “Custom OpenAI-compatible endpoint” (or equivalent — wording varies by version) and enter:
- Base URL:
http://localhost:8000/v1 - Model name:
hermes-3 - API key: anything non-empty (vLLM doesn’t check by default; set one if you want via
--api-keyon the vLLM container)
The wizard also walks you through personality, messaging gateway, and OpenClaw migration if you have one to import. Skip what you don’t need now — everything is reconfigurable later via hermes model, hermes gateway setup, etc.
Test the connection on the CLI:
hermes
Type something. The first response will take a moment as vLLM warms up the request path, then subsequent responses should be quick. Ctrl+C to exit.
8. Set up the messaging gateway
If you skipped the gateway during setup:
hermes gateway setup
For Telegram (the most common pick), this walks you through creating a bot via BotFather, pairing your account, and setting the allowed user list. Take your time on the allowed-users step — this is your “only I can talk to my agent” check.
Test in the foreground first:
hermes gateway
Send your bot a message from your phone. It should respond, going round-trip through Telegram → your VM → your A100 → back. Ctrl+C to stop.
9. Install the gateway as a systemd service
Make it permanent:
hermes gateway install
This creates a user systemd unit that starts the gateway on boot, restarts on failure, and logs to the journal:
systemctl --user status hermes-gateway
journalctl --user -u hermes-gateway -f
Enable user-service lingering so it actually starts on boot rather than only when you log in:
sudo loginctl enable-linger ubuntu
Reboot the VM and confirm both the gateway and vLLM come back without manual intervention. Send a Telegram message to check end to end.
10. Back up ~/.hermes to Leafcloud object storage
A daily restic job covering memory, skills, and OAuth tokens. Models live in /var/hermes-data/models and are re-downloadable from HuggingFace, so they don’t need to be in the backup set.
#!/bin/bash
# /usr/local/bin/hermes-backup.sh
set -euo pipefail
export RESTIC_REPOSITORY=s3:https://leafcloud.store/hermes-backups
export RESTIC_PASSWORD_FILE=/etc/restic/password
export AWS_ACCESS_KEY_ID=<object-storage-access-key>
export AWS_SECRET_ACCESS_KEY=<object-storage-secret-key>
restic backup /var/hermes-data/hermes
restic forget --keep-daily 7 --keep-weekly 4 --keep-monthly 6 --prune
Drop it in cron at 3 AM and you’re done.
11. Keeping the stack up to date
# Update Hermes
hermes update
# Update vLLM
docker pull vllm/vllm-openai:latest
docker stop vllm && docker rm vllm
# (re-run the vLLM launch command from step 5)
To swap models — to the 70B, for example — stop the vLLM container and relaunch with --model NousResearch/Hermes-3-Llama-3.1-70B-AWQ and add --quantization awq. Hermes doesn’t need to know; the served model name stays hermes-3.
What you’ve actually deployed
A persistent-memory AI agent and the model it runs on, both hosted on European infrastructure, both isolated to a single VM with no public attack surface beyond SSH, both backed up to object storage. Your prompts never leave the building. The waste heat from the GPU goes into a real building’s heating loop instead of a hyperscaler’s chiller.
If you outgrow a single A100 — multiple users, larger models, parallel sub-agents pushing throughput — the next steps are either a multi-GPU flavor with a larger model, or Managed Kubernetes with vLLM behind a load balancer and Hermes as a separate workload. The same memory directory and the same hermes-3 served model name keep working.
Ready to give your AI agent a sovereign home?
Spin up an A100 and run through this guide yourself. Sign up at leaf.cloud to get started, or book a call with our team if you want to talk through sizing, model selection, or pairing Hermes with other Leafcloud services. We answer.
related