GPU support¶
Overview¶
The language identification and prompt injection processors can use a Graphics Processing Unit (GPU) to improve performance.
Requirements¶
One or more nodes available in your Kubernetes cluster which have access to CUDA compatible NVIDIA GPU(s) and are configured with Kubernetes GPU scheduling.
Enabling GPU support for processors¶
In your Helm values:
Set
processors.f5.gpu.enabled
totrue
Add
"nvidia.com/gpu": 1
toprocessors.f5.resources.limits
You can verify that the processors are using the GPU by checking the processor logs for CUDA compatible GPU(s) detected
.
It is possible to deactivate GPU support for an individual processor by setting environment variables for LANGUAGE_ID_PROCESSOR_ENABLE_GPU: "false"
or PROMPT_INJECTION_PROCESSOR_ENABLE_GPU: "false"
in processors.f5.env
.
CUDA Support¶
The language identification and prompt injection processors have been tested using CUDA 12.4 to 12.6 on amd64 (x86_64) architecture with Linux OS.
Memory Requirements¶
Processors that have GPU support have their base memory requirements listed in their ‘Processor Details’ table.
Memory requirements increase under the following conditions:
During processing of requests
Larger chunk size configurations
Larger batch size configurations
There is some additional overhead required for the framework used to run machine learning models which may reserve more memory than is strictly necessary for the model itself, however this overhead is usually small compared to the memory footprint of the model.
In scenarios with limited memory it is important to remember that any overhead in addition to memory required during inference will push total memory usage above the combined size of the base model and associated tokenizer.
Note
Inference is the process of asking a model perform the task for which it is trained such as text classification.
Always perform empirical tests on hardware with real or representative data to determine your environment’s complete memory requirements.