Language Identification¶

Before you begin¶

Follow the steps in the Install with Helm topic to run F5 AI Gateway.

Overview¶

The F5 Language Identification processor runs in the AI Gateway processors container and predicts with confidence the language(s) of a given prompt or response using a pre-trained classification model. The output is a two-letter language code that follows the ISO 639-1 standard. The Language Identification processor can also detect programming code in text using known code patterns and indicators.

Processor details	Supported
Deterministic	No
GPU acceleration support	Yes
Base Memory Requirement	1.12 GB
Input stage	Yes
Response stage	Yes
Recommended position in stage	Beginning
Supported language(s)	See supported languages

Configuration¶

processors:
  - name: language-id
    type: external
    config:
      endpoint: https://aigw-processors-f5.ai-gateway.svc.cluster.local
      namespace: f5
      version: 1
    params:
      code_detect: false
      threshold: 0.5
      reject: false
      allowed_languages: []

Parameters	Description	Type	Required	Defaults	Examples
Common parameters
`code_detect`	Detect programming code in text using known code patterns and indicators. Uses the `code` tag when code is detected.	bool	No	`false`	`true`, `false`
`threshold`	Confidence threshold value to detect language, anything below this value is returned as `unknown` with confidence value of `0.0`. Set to `0.0` to disable threshold. Note: `unknown` predictions are still possible with threshold disabled.	float `0.0` to `1.0`	No	`0.5`	`0.42`
`allowed_languages`	List of languages that are allowed for the processor to proceed with the request. All detected languages must be in this list. When not set, all languages are allowed.	list[str]	No	`[]`	`["en", "fr"]`

Note

The reject parameter must be set to true to use the allowed_languages parameter.

Tags¶

The detected languages are added to the processor response tags in the ISO 639-1 format. If the processor detects programming code, it adds code to the tags. If the processor is unable to determine the language, or the confidence value is below the threshold value, it adds unknown to the tags.

Tag key	Description	Example values
`language`	Any languages detected by the processor	`["en", "code"]`

Supported languages¶

The Language Identification processor comes with support for the languages listed in the table below. English "en" encompasses both British and American English. Chinese "zh" is Simplified Chinese.

Language	Code	Language	Code
Arabic	ar	Japanese	ja
Bulgarian	bg	Polish	pl
Chinese	zh	Portuguese	pt
Dutch	nl	Russian	ru
English	en	Spanish	es
French	fr	Swahili	sw
German	de	Thai	th
Greek	el	Turkish	tr
Hindi	hi	Urdu	ur
Italian	it	Vietnamese	vi

Accuracy¶

The number of tokens in the input has a direct impact on the accuracy of the prediction. The model depends on contextual clues from neighboring words and sentence structure in order to capture the semantic relationships which aid classification.

Note

Tokens are the smallest units of text a machine learning model processes. While they often match entire words, models may split words into multiple tokens (e.g., subwords or characters) for better handling of rare or complex terms.

Token count	Approximate accuracy
1-5	76%
6-10	96%
11+	99%

Code Detection¶

Code detection is deterministic in the Language Identification processor. It uses a set of regular expression patterns and keywords to detect code in text but is not an exhaustive list of patterns and keywords which encompass all programming languages. The processor might not detect 100% of the code 100% of the time.

Chunking input and batch processing¶

The Language Identification processor will split inputs and responses into chunks and perform inference on these chunks in batches.

Note

Always perform empirical tests on hardware with real or representative data. Profiling is the best way to see how changing chunk and/or batch sizes impacts performance.

Chunking input¶

Chunk size indicates how much data from a single input is fed to the model at once. It’s driven by the underlying model constraint on maximum sequence length and task needs for context. It directly impacts memory usage per inference call and can affect latency if chunks are too large.

The Language Identification processor splits its input into chunks of 32 to 512 tokens (default: 128). Configurable by setting LANGUAGE_ID_PROCESSOR_CHUNK_SIZE in the processors.f5.env section of the AI Gateway Helm chart.

Batch processing¶

Batch size determines how many separate inputs (or chunks) are processed simultaneously. Larger batch sizes can improve performance by taking advantage of parallel processing but can also saturate the GPU. The default batch size is 16. There is no upper limit on this but it must be a value greater than or equal to 1. It is possible to override this value by setting environment variables LANGUAGE_ID_PROCESSOR_BATCH_SIZE: 32 in processors.f5.env.

Previous Next