GPU support

Overview

The language identification and prompt injection processors can use a Graphics Processing Unit (GPU) to improve performance.

Requirements

One or more nodes available in your Kubernetes cluster which have access to CUDA compatible NVIDIA GPU(s) and are configured with Kubernetes GPU scheduling.

Enabling GPU support for processors

In your Helm values:

  • Set processors.f5.gpu.enabled to true

  • Add "nvidia.com/gpu": 1 to processors.f5.resources.limits

You can verify that the processors are using the GPU by checking the processor logs for CUDA compatible GPU(s) detected.

It is possible to deactivate GPU support for an individual processor by setting environment variables for LANGUAGE_ID_PROCESSOR_ENABLE_GPU: "false" or PROMPT_INJECTION_PROCESSOR_ENABLE_GPU: "false" in processors.f5.env.

CUDA Support

The language identification and prompt injection processors have been tested using CUDA 12.4 to 12.6 on amd64 (x86_64) architecture with Linux OS.

Memory Requirements

Processors that have GPU support have their base memory requirements listed in their ‘Processor Details’ table.

Memory requirements increase under the following conditions:

  • During processing of requests

  • Larger chunk size configurations

  • Larger batch size configurations

There is some additional overhead required for the framework used to run machine learning models which may reserve more memory than is strictly necessary for the model itself, however this overhead is usually small compared to the memory footprint of the model.

In scenarios with limited memory it is important to remember that any overhead in addition to memory required during inference will push total memory usage above the combined size of the base model and associated tokenizer.

Note

Inference is the process of asking a model perform the task for which it is trained such as text classification.

Always perform empirical tests on hardware with real or representative data to determine your environment’s complete memory requirements.