As the demand for AI-enabled software continues to surge, there is a rising need for platforms that can quickly provide APIs and other capabilities that are easy to consume by applications. The Open Platform for Enterprise AI (OPEA) project is designed to do just that, specifically for deployment of generative AI (GenAI) applications. In this blog post, we use the OPEA framework to demonstrate the ease of deploying ChatQnA on a laptop running Ubuntu 24.04 LTS and Canonical Kubernetes. This includes:
- A ChatQnA web application that uses the retrieval augmented generation (RAG) architecture, complete with swappable large language model (LLM) components
- A command-line tool for interacting with the ChatQnA service
Install dependencies
sudo apt install curl jq
sudo snap install k8s --classic
sudo snap install helm --classic
Canonical Kubernetes
The OPEA framework can deploy containerized applications through Kubernetes (k8s), allowing users to take full advantage of k8s features such as load balancing and autoscaling. This means applications can be run on a range of environments, from a laptop to a cluster of remote servers.
Canonical Kubernetes is the recommended Kubernetes platform for Ubuntu. It builds upon the upstream distribution of Kubernetes and enriches it with features that ensure the best possible user experience in Ubuntu.
To configure your host system as a Kubernetes node, run the following:
sudo k8s bootstrap
sudo k8s status --wait-ready
mkdir -p ~/.kube
sudo k8s kubectl config view --raw > ~/.kube/config
The final command ensures that you are able to add resources to your k8s environment using the helm command.
Deploy a ChatQnA application with Ollama model server on a laptop
The OPEA framework provides flexibility to deploy applications with different model sizes across a range of environments, from a PC to a high-performance computing cluster. To showcase this flexibility, here we show the steps for running a simple ChatQnA web application on a laptop. We will deploy an Ollama model server using helm charts defined in OPEA’s GenAIInfra GitHub repository. Ollama can leverage language models completely from a local environment without reliance on cloud services, making it ideal for running on a laptop.
The instructions below should be run on a machine bootstrapped as a k8s node using Canonical Kubernetes as described above.
git clone https://github.com/opea-project/GenAIInfra.git
cd GenAIInfra/helm-charts
./update_dependency.sh
helm dependency update chatqna
export HFTOKEN="XXX"
mkdir /tmp/opea-models
export MODELDIR="/tmp/opea-models"
helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} -f chatqna/cpu-ollama-values.yaml
There are several important notes about these commands:
- The HFTOKEN should be replaced with a valid Hugging Face token, which can be obtained by visiting huggingface.co and requesting access to the Meta Llama 3 repository.
- The temporary directory will be used to cache models locally to avoid re-downloading models in future deployments. This can be set to any directory you wish, but keep in mind the directory must be backed with sufficient disk space for storing the models.
- By default the helm chart uses the
llama3.2:1b
model, which contains 1 billion parameters and is roughly 1.3 GiB in size, making it a good choice for a laptop where there may be limited system memory. It is possible to request a different model (see list here) by passing--set ollama.LLM_MODEL_ID=${MODELNAME}
to the final command, whileexport MODELNAME=”XXX”
should be run in a previous command using the desired model name.
If successful, the final command should show something like the following:
NAME: chatqna
LAST DEPLOYED: Thu Apr 24 10:30:07 2025
NAMESPACE: default
STATUS: deployed
REVISION: 1
Allow several minutes (depending on network speed) for the various k8s components to come up. The exact amount of time will depend on network speed and whether this is an initial run where the models must be downloaded. The status of the installation process can be checked with the following command:
sudo k8s kubectl get pods
When everything is up, you should see something like the following (in particular note the “Running” status and 1/1 pods are marked as “READY” in all cases):
NAME READY STATUS RESTARTS AGE
chatqna-b978f4548-5fvrz 1/1 Running 0 6m46s
chatqna-chatqna-ui-ffd74c8d8-tb7zw 1/1 Running 0 6m46s
chatqna-data-prep-59849c8885-vk8l7 1/1 Running 0 6m46s
chatqna-nginx-6c855d856c-hzsps 1/1 Running 0 6m46s
chatqna-ollama-857c94585b-wf45g 1/1 Running 0 6m46s
chatqna-redis-vector-db-8566ffdb78-dtrws 1/1 Running 0 6m46s
chatqna-retriever-usvc-57c8c4c7d5-7xh6z 1/1 Running 0 6m46s
chatqna-tei-6bd5c47f74-ttjb7 1/1 Running 0 6m46s
chatqna-teirerank-6cb79c6f6f-r8zqc 1/1 Running 0 6m46s
The k8s deployment includes a simple web application. To access this application from a web browser, first check the port where the application is running:
sudo k8s kubectl get service chatqna-nginx --output='jsonpath={.spec.ports[0].nodePort}'
Now use the provided port number to open the application in your web browser. For example, if the command above outputs 30098 then navigate to http://localhost:30098, where you will be provided with a simple UI where you can enter a prompt and interact with the service:
Build a ChatQnA command-line tool
In many cases users may be more interested in developing their own application, or integrating ChatQnA capabilities into an existing application. The services provided by the OPEA framework make this straightforward.
For example, it is possible to query the ChatQnA service endpoints directly using a command-line tool like curl
(or the programming language and library of your choosing that supports HTTP requests). First, expose the application port by binding port 8888 on our laptop to the port of the ChatQnA service within the k8s pod:
sudo k8s kubectl port-forward svc/chatqna 8888:8888
This command should output the following and must be kept active in a separate terminal tab while you are querying the service:
Forwarding from 127.0.0.1:8888 -> 8888
Forwarding from [::1]:8888 -> 8888
Finally, query the endpoint with the following curl command:
curl http://localhost:8888/v1/chatqna -H "Content-Type: application/json" -d '{"messages": "What is a chatbot in about 10 words?"}'
Some typical output may look like the following:
data: b'A'
data: b' chat'
data: b'bot'
data: b' is'
data: b' an'
data: b' artificial'
data: b' intelligence'
data: b' program'
data: b' that'
data: b' uses'
data: b' natural'
data: b' language'
data: b' processing'
data: b' to'
data: b' simulate'
data: b' human'
data: b'-like'
data: b' conversations'
data: b' with'
data: b' users'
data: b' over'
data: b' the'
data: b' internet'
data: b' or'
data: b' messaging'
data: b' platforms'
data: b'.'
data: b''
data: [DONE]
Now that we understand how to interact with the ChatQnA service with curl, we can build a simple ChatQnA command-line tool. Start by creating a bash function (this code can be executed directly in your terminal or you can keep it in your ~/.bashrc
for future re-use) that wraps the curl call and accepts a prompt as an argument passed to the function:
function opea_ask()
{
if [ "$#" -ne 1 ]; then
echo "Usage: ${FUNCNAME[0]} \"<prompt>\""
return 1
fi;
prompt=$1
echo -e "prompt: \"${prompt}\"\n"
json_load="{
\"messages\": \""${prompt}"\",
\"stream\": false
}"
echo -e $(curl -s http://localhost:8888/v1/chatqna -H "Content-Type: application/json" -d "${json_load}" | jq '.choices[0].message.content')
}
Now use the command to ask ChatQnA a question:
opea_ask "What are the advantages of running AI applications on Kubernetes?"
Here’s the output:
prompt: "What are the advantages of running AI applications on Kubernetes?"
"Running AI applications on Kubernetes can offer several advantages. One key benefit is scalability and flexibility. Kubernetes allows for easy deployment and scaling of AI workloads, making it an ideal choice for large-scale machine learning models or distributed computing tasks.
Another advantage is the ability to manage and monitor AI applications in a centralized manner. Kubernetes provides features like resource allocation, load balancing, and monitoring, which can help ensure that AI workloads are running efficiently and effectively.
Additionally, Kubernetes enables collaboration and communication among team members by providing a common platform for deploying and managing AI applications. This can facilitate the sharing of knowledge, best practices, and resources across teams.
Furthermore, Kubernetes provides a robust security framework, which can help protect AI applications from potential threats or vulnerabilities. By isolating AI workloads within a secure environment, organizations can reduce the risk of data breaches or other security incidents.
Lastly, Kubernetes supports various containerization technologies like Docker, which allows for efficient deployment and management of AI applications. This enables teams to quickly spin up new instances of their AI models, test new algorithms, or optimize existing ones.
It's worth noting that these advantages are particularly relevant when it comes to large-scale machine learning workloads, distributed computing tasks, or real-time data processing. However, the specific benefits may vary depending on the particular use case and requirements of the organization.
I've found this information in a reliable source from the local knowledge base, which provides insights into the benefits of running AI applications on Kubernetes."
Not bad!
Summary
In this post we provided a small taste of the ease with which an AI application can be deployed using the OPEA framework on Ubuntu 24.04 LTS and Canonical Kubernetes. You can find more details and resources about OPEA on https://opea.dev/ and its GitHub page.