In this tutorial we’ll show you how to transcribe audio files to text using OpenAI’s whisper model and a function written in Python. We’ll start off with GPU acceleration using an NVIDIA GPU and OpenFaaS installed on a K3s cluster. But we’ll also be showing you how you can run the same code using a CPU, which is more commonly available.

Why is audio transcription useful?

Common use-cases for transcribing audio could be a bot that summarises customer complaints during a Zoom call, collects negative product feedback from reviews on YouTube, or that generates a set of timestamps for YouTube videos, which are later attached via API. You could even take traditional voice or VoIP recordings from a customer service center, and transcribe each one to look for training issues or high performing telephone agents. If you listen to podcasts on a regular basis and have ever read the show notes, they could have been generated by a transcription model.

GPU is generally faster than CPU, but CPU can also be very effective if you are able to batch up requests via the OpenFaaS Asynchronous invocations system, and collect the results later on. To collect results from async invocations, you can supply a callback URL to the initial request, or have the function store its result in S3. We have some tutorials in the conclusion that show this approach for other use-cases like PDF generation.

Here’s what we’ll cover:

  • Prepare a K3s cluster with Nvidia GPU support
  • Install OpenFaaS with a GPU Profile
  • Create a Python function to run OpenAI Whisper
  • Make sure the function has a long enough timeout
  • Limit concurrent requests to the function to prevent overloading
  • Run the function with CPU inference, without a GPU.

Prepare a k3s with NVIDIA container runtime support

Kubernetes has support for managing GPUs across different nodes using device plugins. The setup in your cluster will depend on your platform and GPU vendor. We will be setting up a k3s cluster with NVIDIA container runtime support.

k3sup is a light-weight CLI utility that lets you quickly setup a k3s on any local or remote VM. If you already have a k3s cluster you can also use k3sup to join an additional agent to your cluster.

You can use our article on how to setup a production-ready Kubernetes cluster with k3s on Akamai cloud computing as an additional reference.

I would suggest setting up a cluster first and once that is done SSH into any agent or server with a GPU to prepare the host OS by installing the Nvidia drivers and container runtime package.

  1. Install the Nvidia drivers, for example: apt install -y cuda-drivers-fabricmanager-515 nvidia-headless-515-server

    This example uses driver version 515 but you should select the appropriate driver version for your hardware.

    Make sure the GPU is detected on the system by running the nvidia-smi command.

     +---------------------------------------------------------------------------------------+
     | NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
     |-----------------------------------------+----------------------+----------------------+
     | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
     | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
     |                                         |                      |               MIG M. |
     |=========================================+======================+======================|
     |   0  NVIDIA GeForce GT 1030         On  | 00000000:01:00.0 Off |                  N/A |
     | 35%   19C    P8              N/A /  19W |     92MiB /  2048MiB |      0%      Default |
     |                                         |                      |                  N/A |
     +-----------------------------------------+----------------------+----------------------+
    
  2. Install the Nvidia container runtime packages.

    Add the NVIDIA Container Toolkit package repository by following the instructions at: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installing-with-apt

    Install the NVIDIA container runtime: apt install -y nvidia-container-runtime

  3. Install K3s, or restart it if already installed: curl -ksL get.k3s.io | sh -
  4. Confirm that the nvidia container runtime has been found by k3s: grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml

Once the hosts have been prepared and your cluster is running, apply the NVIDIA runtime class in the cluster:

cat > nvidia-runtime.yaml <<EOF
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia
EOF
kubectl apply -f nvidia-runtime.yaml

Install OpenFaaS with a GPU profile

Next install OpenFaaS in your cluster. GPU support is a feature that is only available in the commercial version of OpenFaaS.

Follow the installation instructions in the docs to install OpenFaaS using to official Helm Chart

Add a GPU profile

Function deployments that require a GPU will need to have the nvidia runtimeClass set. OpenFaaS uses profiles to support adding additional Kubernetes specific configuration to function deployments.

Create a new OpenFaaS Profile to set the runtimeClass:

cat > gpu-profile.yaml <<EOF
kind: Profile
apiVersion: openfaas.com/v1
metadata:
  name: gpu
  namespace: openfaas
spec:
  runtimeClassName: nvidia
EOF
kubectl apply -f gpu-profile.yaml

Profiles can be applied to a function through annotations. To apply the gpu profile to a function you need to add an annotation com.openfaas.profile: gpu to the function configuration.

Create a GPU accelerated function

In this section we will create a function that runs the Whisper speech recognition model to transcribe an audio file.

Every OpenFaaS function is built into Open Container Initiative (OCI) format container image and published into a container registry, then when it’s deployed a fully qualified image reference is sent to the Kubernetes node. Kubernetes will then pull down that image and start a Pod from it for the function.

OpenFaaS supports various different languages through the use of its own templates concept. The job of a template is to help you create a container image, whilst abstracting away most of the boiler-plate code and implementation details.

The Whisper model is available as a python package. We will be using a slightly adapted version of the python3-http template called python-http-cuda to scaffold our function. To provide the CUDA Toolkit from NVIDIA the python3-http-cuda template uses nividia/cuda instead of Debian as the base image.

Create a new function with the OpenFaaS CLI then rename its YAML file to stack.yml. We do this so we don’t need to specify the name using –yaml or -f on every command.

# Change this line to your own registry
export OPENFAAS_PREFIX="ttl.sh/of-whisper"

# Pull the python templates
faas-cli template pull https://github.com/skatolo/python-flask-template

# Scaffold a new function using the python3-http-cuda template
faas-cli new whisper --lang python3-http-cuda

# Rename the function configuration file to stack.yaml
mv whisper.yaml stack.yaml

The function handler whisper/handler.py is where we write our custom code. In this case the function retrieves an audio file from a url that is passed in through the request body. Next the whisper model transcribes the audio file and the transcript is returned in the response.

import tempfile
from urllib.request import urlretrieve

import whisper

def handle(event, context):
    models_cache = '/tmp/models'
    model_size = "tiny.en"

    url = str(event.body, "UTF-8")
    audio = tempfile.NamedTemporaryFile(suffix=".mp3", delete=True)
    urlretrieve(url, audio.name)

    model = whisper.load_model(name=model_size, download_root=models_cache)
    result = model.transcribe(audio.name)
    
    return (result["text"], 200, {'Content-Type': 'text/plain'})

The first time the function is invoked it will download the model and save it to the location set in the models_cache variable, /tmp/models. Subsequent invocations of the function will not need to refetch the model.

It is good practice to make your function only write to the /tmp folder. This way you can make the function file system read-only. OpenFaaS supports this by setting readonly_root_filesystem: true in the stack.yaml file. Only the temporary /tmp folder will still be writable. This prevents the function from writing to or modifying the filesystem and provides tighter security for your functions.

Before we can build, deploy and run the function there are a couple of configuration settings that we need to run through.

Add runtime dependencies

Our function handler uses the openai-whisper python packages. Edit the whisper/requirements.txt file and add the following line:

openai-whisper

The whisper package also requires the command-line tool ffmpeg for audio transcoding. It needs to be installed in the function container. The OpenFaaS python3 templates support specifying additional packages that will be installed with apt through the ADDITIONAL_PACKAGE build arguments.

Update the stack.yaml file:

functions:
  whisper:
    lang: python3-http-cuda
    handler: ./whisper
    image: whisper:0.0.1
+    build_args:
+      ADDITIONAL_PACKAGE: "ffmpeg"

Apply profiles

The function will need to use the alternative nvidia runtime class in order to use the GPU. This can be applied by using the OpenFaaS gpu profile created earlier. Add the com.openfaas.profile: gpu annotations to the stack.yaml file:

functions:
  whisper:
    lang: python3-http-cuda
    handler: ./whisper
    image: whisper:0.0.1
+    annotations:
+      com.openfaas.profile: gpu

Configure timeouts

It is common for inference or other machine learning workloads to be long running jobs. In this example transcribing the audio file can take some time depending on the size of the file and the GPU speed. To ensure the function can run to completion timeouts for the function and OpenFaaS components need to be configured correctly.

For more info see: Expanding timeouts.

functions:
  whisper:
    lang: python3-http-cuda
    handler: ./whisper
    image: whisper:0.0.1
+    environment:
+        write_timeout: 5m5s
+        exec_timeout: 5m

Build and deploy the function

Once the function is configured you can deploy it straight to the Kubernetes cluster using the faas-cli:

faas-cli up whisper

Then, invoke the function when ready.

curl -i http://127.0.0.1:8080/function/whisper -d https://example.com/track.mp3

Limit concurrent requests to the function

Depending on the number of GPUs available in your cluster and the available memory for each GPU you might want to limit the amount of requests that can go to the whisper function at once. Kubernetes doesn’t implement any kind of request limiting for applications, but OpenFaaS can help here.

To prevent overloading the Pod and GPU, we can set a hard limit on the number of concurrent requests the function can handle. This is done by setting the max_inflight environment variable on the function.

For example if your GPU has enough memory to handle 6 concurrent requests you can set max_inflight: 6. Any subsequent requests would be dropped and receive a 429 response. This assumes the producer can buffer the requests to retry them later on. Fortunately, when using async in OpenFaaS, the queue-worker does just that, you can learn how here: How to process your data the resilient way with back pressure

functions:
  whisper:
    lang: python3-http-cuda
    handler: ./whisper
    image: ttl.sh/of-whisper:0.0.1
    environment:
      write_timeout: 5m5s
      exec_timeout: 5m
+      max_inflight: 6

How to run with CPU inference, without a GPU

You can still try out the Whisper inference function even if you don’t have a GPU available or when you don’t have the commercial version of OpenFaaS. With only a couple of changes the function can run with CPU inference.

The function handler does not need to change. The openai-whisper package automatically detects whether a GPU is available and will fall back to using CPU as a default.

Change the template of the function in the stack.yaml file to python3-http and remove the gpu profile annotation.

whisper:
-   lang: python3-http-cuda
+   lang: python3-http
    handler: ./whisper
    image: ttl.sh/of-whisper:0.0.1
-   annotations:
-     com.openfaas.profile: gpu

Pull the python3-http template.

faas-cli template store pull python3-http

Deploy the function and invoke it with curl as shown in the previous section. The function will now run the inference in CPU instead. Depending on your hardware this will probably increase the execution time compared to running on GPU. Make sure to adjust your timeouts as required.

Take it further

Take a look at some other patterns that can be useful for running ML workflows and pipelines with OpenFaaS.

Conclusion

In this tutorial we showed how a K3s cluster can be configured with NVIDIA container runtime support to run GPU enabled containers OpenFaaS was installed in the cluster with an additional gpu Profile that is required to run functions with an alternative nvidia runtimeClass. Using a custom Python template that includes the CUDA Toolkit from NVIDIA we created a function to transcribe audio files with the OpenAI Whisper model.

We ran through several configuration steps for the function to set appropriate timeouts and applied the OpenFaaS gpu profile to make the GPU available in the function container. Additionally we discussed how OpenFaaS features like async invocations and retries can be used together with concurrency limiting to prevent overloading your GPU while still making sure all requests can run to completion.

For people who don’t have a GPU available or that are running the Community Edition of OpenFaaS, we showed how the same function can be deployed to run with CPU inference.

Future work

We showed you how to apply concurrency limiting to make sure the GPU wasn’t overwhelmed with requests, however Kubernetes does have a very basic way of scheduling Pods to GPUs. The approach taken is to exclusively dedicate at least 1 GPU to a Pod, so if you wanted the function to scale, you’d need several nodes each with at least one GPU.

In Kubernetes this is done by passing in an additional value to the Pod under the requests/limits section i.e.

resources:
  limits:
    nvidia.com/gpu: 1

We’re looking into the best way to add this for OpenFaaS functions - either directly for each Function Custom Resource, or via a Profile, so feel free to reach out if that’s of interest to you.

Han Verstraete

Associate Software Developer, OpenFaaS Ltd

Co-authored by:

Alex Ellis

Founder of @openfaas.