Large language models (LLMs) are transforming the IT landscape, offering capabilities that were once science fiction. Public models like ChatGPT and Copilot demonstrate the potential of large language models. However, organizations with specific needs, data security concerns, and a desire for control may find private LLMs to be a better option. At Leafcloud, the leading sustainable cloud provider, we recognize the environmental impact of large language models (LLMs) as a growing concern. Training and running these models can consume significant resources. That's why we focus on sustainability by using waste heat from GPUs to power water heating systems. This innovative method allows you to leverage the power of LLMs while significantly reducing your environmental footprint.
In the first part of this series, I will guide you through setting up a large language model in Kubernetes and installing all necessary tools, while in the second part, I will demonstrate how to leverage the capabilities of your LLMs by creating custom chatbots, embedding your documents, and connecting your LLM to VS Code.
Why Consider a Private LLM?
Private LLMs offer several advantages:
- Data Control and Privacy: Keep your sensitive data in-house, ensuring regulatory compliance and mitigating risks of data breaches.
- Cost Efficiency: Predictable costs, potential economies of scale, and long-term savings compared to ongoing external service payments.
- Customization: Fine-tune the model with your proprietary data for enhanced accuracy and relevance to your specific needs. Seamless integration with existing systems using Retrieval-Augmented Generation (RAG) is also possible.
Is Setting Up a Private LLM Right for You?
Setting up a private LLM has become more approachable thanks to a growing range of open-source tools and pre-trained models. While some technical expertise is still required, the process is no longer limited to just large organizations with extensive resources.
If you're intrigued by the potential of private LLMs and have a basic understanding of relevant technologies, don't be discouraged! This guide provides a roadmap to get you started.
The full code of this tutorial can be found here: github
The Setup
Here's a breakdown of the tools and technologies involved
- Managed Kubernetes Cluster with Gardener: Gardener simplifies the creation and management of a Kubernetes cluster, which acts as the underlying platform for running your LLM model.
- Ollama: An open-source tool specifically designed for running large language models efficiently within a Kubernetes environment.
- AnythingLLM: Provides a user interface and functionalities for interacting with your LLM model. It allows you to manage tasks, user access, and potentially fine-tune the model.
- Ingress NGINX: Acts as a traffic director, routing incoming requests to the appropriate components within your LLM setup.
- Certbot: Automates the management of SSL certificates, ensuring secure communication within your private LLM environment.
- chromadb: An open-source database that stores LLM embeddings. It enables you to add documents to your LLM model, allowing you to customize the model's responses based on specific documents or contexts.
This combination of open-source tools provides a secure and customizable foundation for leveraging the power of private large language models.
1. Create a Managed Kubernetes Cluster using Gardener
To create a managed Kubernetes cluster using Gardener, follow these steps:
1. Log in to your Gardener dashboard at https://dashboard.gardener.leaf.cloud
2. Click on the + icon to start creating a new Kubernetes cluster.
Under name give the cluster a suitable name, such as "llm-cluster". Infrastructure and dns
settings can be left with default values.
3. Configure Worker Group:
In this section, select the machine type for your cluster. This decision will impact factors like budget, user load, and the size of the models you plan to run. Refer to the explanation above for considerations when choosing your machine type. For more complex worker group configurations, consult the Gardener documentation.
Example: This tutorial will use Ollama with 2 replicas to run the Llama3 8B model (a relatively small model). A V100 or A30 machine with 2 GPUs is sufficient for this scenario. The image depicts selecting the appropriate machine type in the Gardener dashboard.
Click on the "Create" button to create your cluster. This process can take several minutes depending on the size of the machine you selected. Once the cluster is created, it will be listed in your Gardener dashboard.
For more information on how to create a managed Kubernetes cluster using Gardener and working with gardenctl to manage your kubeconfig files, check the following tutorial: Creating a Kubernetes Cluster.
2. Nvidia Drivers and CUDA Toolkit Installation
To utilize the GPUs with Ollama, we need to install the necessary drivers and CUDA toolkit on our Kubernetes nodes. This can be done using the Nvidia Kubernetes Operator. The operator automatically detects if a node has GPUs and installs the drivers.
For detailed instructions, refer to the Nvidia documentation: Nvidia GPU Operator.
To install the operator, run:
kubectl create ns gpu-operator
kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged
helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --set toolkit.enabled=true --version 23.6.2
It can take a few minutes before the operator is ready to use. You can check its status by running:
kubectl get pods -n gpu-operator
When all nvidea pods are successfully deployed, we can inspect our node to see if the operator has successfully detected the gpu and installed the drivers and CUDA toolkit.
To do so, run:
kubectl describe nodes
In the label section, you should see labels like this:
nvidia.com/gpu.memory=24576
nvidia.com/gpu.present=true
nvidia.com/gpu.product=NVIDIA-A30
In the resources sections the gpu should be listed.
3. Assigning Public Access
To access your private LLM service and AnythingLLM from outside the Kubernetes cluster (optional), you'll need to create public hostnames. Subdomains act as unique addresses within a larger domain. Leafcloud allows you to create free subdomains under the .domains.leaf.cloud domain.
In this step, we'll create two subdomains:
- ollama.testcompany.domains.leaf.cloud - This will point to your Ollama service.
- anything-llm.testcompany.domains.leaf.cloud - This will point to the AnythingLLM interface.
Creating Subdomains
For detailed instructions on creating subdomains in Leafcloud, refer to this tutorial: Leafcloud Documentation
4. Streamlined Installation with Helmfile
In this step, we'll focus on installing the necessary tools using Helm and Helmfile. Helm is a popular package manager for Kubernetes, and Helmfile helps us streamline the installation process by combining multiple Helm charts (think of them as pre-packaged configurations) into a single file (`helmfile.yaml`). This makes managing the installation and updates of various components much easier.
We'll highlight relevant sections of the helmfile.yaml file to demonstrate how Ollama and AnythingLLM are configured. The full version of this file can be found on our GitHub repository.
See here the installation instructions for helmfile:
Helmfile Configuration
The helmfile.yaml file consists of two main sections:
Repositories
In this example, we've specified public repositories for most tools except AnythingLLM. For AnythingLLM, we decided to use Gimlet's onchart, which is a really nice and easy chart to use for generic applications.
repositories:
- name: nginx-stable
url: https://helm.nginx.com/stable
- name: jetstack
url: https://charts.jetstack.io
Releases
This section defines individual software deployments using Helm charts. Here, we'll see how Ollama and AnythingLLM are configured, including details like chart versions, namespaces, and values files (which can hold specific configuration options).
releases:
- name: ollama
namespace: anything-llm
chart: ollama-helm/ollama
labels:
app: ollama
createNamespace: true
version: "0.29.1"
values: [ "./ollama.yaml" ]
Configuring Ollama and AnythingLLM
While most of the Ollama configuration in the helmfile.yaml file is straightforward, here are some key points to highlight:
Ingress Configuration: For both Ollama and AnythingLLM, you'll need to set up the correct hostname in the ingress section.
Ollama Specific Settings:
- GPU Support: We've enabled GPU support in the Ollama configuration, assuming your cluster has GPUs available. This allows the model to leverage the processing power of GPUs for faster performance.
- Model Selection: The configuration specifies the particular model to be used by Ollama. This model will be downloaded automatically when you first start the Ollama service.
- Persistence: Persistence is enabled to ensure the downloaded model is not lost when pods restart. This means the model is stored on a persistent volume and shared between all Ollama replicas, improving efficiency.
ollama:
gpu:
enabled: true
models: - llama3
persistentVolume:
enabled: true
size: 100Gi
Advanced: Running Multiple Models Concurrently (Optional)
Starting with Ollama version 0.1.33, you can run and respond to queries using multiple large language models simultaneously. This is useful if you have different models for various tasks.
Enabling Multi-Model Support
Add these environment variables to your helmfile.yaml file:
extraEnv:
- name: OLLAMA_NUM_PARALLEL
value: "3"
- name: OLLAMA_MAX_LOADED_MODELS
value: "3"
Authentication
AnythingLLM provides built-in authorization and authentication, enabling you to manage user roles and grant access to the chat environment for interacting with your private LLM. However, when it comes to developer access from your development environment (e.g., using the continue.dev extension in Visual Studio Code), Ollama itself lacks built-in authentication.
To address this temporarily, we’ve exposed Ollama outside the Kubernetes cluster and implemented basic authentication using NGINX ingress. Here’s how we achieved this:
Steps Involved:
Generating Credentials:
- We used the htpasswd command to create a username and password combination, which is stored in a file named "auth."
htpasswd -c auth myuser
Creating a Kubernetes Secret:
- The "auth" file is then converted into a Kubernetes Secret named "basic-auth" using kubectl for secure storage.
kubectl create secret generic basic-auth --from-file=auth
Configuring Ingress Authentication:
- The NGINX ingress configuration for Ollama is modified to include annotations specifying basic authentication with the following details:
annotations:
nginx.ingress.kubernetes.io/auth-type: "basic"
nginx.ingress.kubernetes.io/auth-secret: "basic-auth"
nginx.ingress.kubernetes.io/auth-realm: "Authentication Required"
While basic authentication provides a temporary workaround, we are actively seeking a more robust and lightweight solution for securing external access to Ollama.
5 Installing the Private LLM Environment
Once you've configured your helmfile.yaml file (as described in the previous sections), you can install the entire private LLM environment using Helmfile. Here are the commands:
helmfile diff # (Optional) Run this command to see what changes
Helmfile will make before applying them.
helmfile apply # This command will install all the necessary components.
That's it! Let me know what you think and if you have any suggestions send them to hello@leaf.cloud