Language identification¶
Before you begin¶
Follow the steps in the Install with Helm topic to run F5 AI Gateway.
Overview¶
The F5 language identification processor runs in the AI Gateway processors container and predicts with confidence the language(s) of a given prompt or response using a pre-trained classification model. The output is a two-letter language code that follows the ISO 639-1 standard. The language identification processor can also detect programming code in text using known code patterns and indicators.
Processor details |
Supported |
---|---|
No |
|
Yes |
|
Base Memory Requirement |
1.12 GB |
Input stage |
Yes |
Response stage |
Yes |
Beginning |
|
Supported language(s) |
Configuration¶
processors:
- name: language-id
type: external
config:
endpoint: https://aigw-processors-f5.ai-gateway.svc.cluster.local
namespace: f5
version: 1
params:
code_detect: false
threshold: 0.5
reject: false
allowed_languages: []
Parameters |
Description |
Type |
Required |
Defaults |
Examples |
---|---|---|---|---|---|
|
Detect programming code in text using known code patterns and indicators. |
bool |
No |
|
|
|
Confidence threshold value to detect language, anything below this value is returned as |
float |
No |
|
|
|
List of languages that are allowed for the processor to proceed with the request. All detected languages must be in this list. When not set, all languages are allowed. |
list[str] |
No |
|
|
Note
The reject
parameter must be set to true
to use the allowed_languages
parameter.
Supported languages¶
The language identification processor comes with support for the languages
listed in the table below. English "en"
encompasses both British and American
English. Chinese "zh"
is Simplified Chinese.
Language |
Code |
Language |
Code |
---|---|---|---|
Arabic |
ar |
Japanese |
ja |
Bulgarian |
bg |
Polish |
pl |
Chinese |
zh |
Portuguese |
pt |
Dutch |
nl |
Russian |
ru |
English |
en |
Spanish |
es |
French |
fr |
Swahili |
sw |
German |
de |
Thai |
th |
Greek |
el |
Turkish |
tr |
Hindi |
hi |
Urdu |
ur |
Italian |
it |
Vietnamese |
vi |
Accuracy¶
The number of tokens in the input has a direct impact on the accuracy of the prediction. The model depends on contextual clues from neighboring words and sentence structure in order to capture the semantic relationships which aid classification.
Note
Tokens are the smallest units of text a machine learning model processes. While they often match entire words, models may split words into multiple tokens (e.g., subwords or characters) for better handling of rare or complex terms.
Token count |
Approximate accuracy |
---|---|
1-5 |
76% |
6-10 |
96% |
11+ |
99% |
Code Detection¶
Code detection is deterministic in the language identification processor. It uses a set of regular expression patterns and keywords to detect code in text but is not an exhaustive list of patterns and keywords which encompass all programming languages. The processor might not detect 100% of the code 100% of the time.
Chunking input and batch processing¶
The language identification processor will split inputs and responses into chunks and perform inference on these chunks in batches.
Note
Always perform empirical tests on hardware with real or representative data. Profiling is the best way to see how changing chunk and/or batch sizes impacts performance.
Chunking input¶
Chunk size indicates how much data from a single input is fed to the model at once. It’s driven by the underlying model constraint on maximum sequence length and task needs for context. It directly impacts memory usage per inference call and can affect latency if chunks are too large.
The language identification processor splits its input into chunks of 32
to 512
tokens
(default: 128
). Configurable by setting LANGUAGE_ID_PROCESSOR_CHUNK_SIZE
in the
processors.f5.env
section of the AI Gateway Helm chart.
Batch processing¶
Batch size determines how many separate inputs (or chunks) are processed simultaneously.
Larger batch sizes can improve performance by taking advantage of parallel processing but can also
saturate the GPU. The default batch size is 16
. There is no upper limit on this but
it must be a value greater than or equal to 1
. It is possible to override this
value by setting environment variables LANGUAGE_ID_PROCESSOR_BATCH_SIZE: 32
in processors.f5.env
.