Identifier
Filter
This filter identifies custom text based on a given regular expression.
The Identifier filter accepts a list of regular expression-based identifiers. See the policy at the bottom of this page for an example.
Note that backslashes in the regular expression will need to be escaped for the policy to be valid JSON.
Because the
patternis a user-supplied regular expression, each match attempt is time-bounded to guard against catastrophic backtracking (ReDoS). If a pattern exceeds the budget on a given input, matching is aborted and that input yields no matches for the identifier. The budget is controlled by theregex.timeout.mssetting (default1000ms).
Required Parameters
This filter has no required parameters.
Optional Parameters
| Parameter | Description | Default Value |
|---|---|---|
enabled |
When set to false, the filter will be disabled and not applied | true |
ignored |
A list of terms to be ignored by the filter. | None |
caseSensitive |
When set to true, the regular expression will be case sensitive. | true |
classification |
Used to apply an arbitrary label to the identifier, such as "patient-id", or "account-number." | "custom-identifier" |
pattern |
A regular expression for the identifier. Note that backslashes will need to be escaped. | \b[A-Z0-9_-]{4,}\b |
groupNumber |
The regular expression capture group to extract as the identifier (0 is the entire match). Use a capture group when you want to match on surrounding context but redact only part of the match. |
0 |
validator |
An optional named, post-match validator. A match is kept only if the validator passes, so a generic identifier can reject format-valid but invalid values (for example a checksum-invalid number). See Validators below. | None |
windowSize |
Sets the size of the window (in terms) surrounding a span to look for contextual terms. If set, this value overrides the value of span.window.size in the configuration. |
The value of span.window.size which is by default 5. |
priority |
The priority (integer) of this filter. Valid values are any positive integer, where a higher value indicates a higher priority. Priority is used for tie-breaking when two spans may be otherwise identical. | 0 |
Filter Strategies
The filter may have zero or more filter strategies. When no filter strategy is given the default strategy of REDACT is
used. When multiple filter strategies are given the filter strategies will be applied in as they are listed.
See Filter Strategies for details.
| Strategy | Description |
|---|---|
REDACT |
Replace the sensitive text with a placeholder. |
MASK |
Replace each character of the sensitive text with a mask character (* by default). |
TRUNCATE |
Replace all but a few characters of the sensitive text with a truncation character (* by default). |
RANDOM_REPLACE |
Replace the sensitive text with a similar, random value. |
STATIC_REPLACE |
Replace the sensitive text with a given value. |
CRYPTO_REPLACE |
Replace the sensitive text with its encrypted value. |
HASH_SHA256_REPLACE |
Replace the sensitive text with its SHA256 hash value. |
FPE_ENCRYPT_REPLACE |
Replace the sensitive text with a value generated by format-preserving encryption (FPE) |
LAST_4 |
Replace the sensitive text with just the last four characters of the text. |
Validators
A regular expression matches a format, but it cannot tell a valid identifier from a value that
merely has the same shape. The optional validator runs a named, built-in check on each match and
keeps the match only if the check passes. This lets a generic identifier reject format-valid but
invalid values without a dedicated filter and without embedding any executable code in the policy.
The validator may be written in either of two forms:
"validator": "luhn"
"validator": { "name": "luhn", "params": { } }
The object form is used by validators that take parameters, for example:
"validator": { "name": "mod11", "params": { "variant": "cpf" } }
For a validator that takes no parameters, such as luhn, the string and object forms are
equivalent. The validator name must be one defined by the
redaction policy schema. An unknown
name, or a name the current build does not implement, is a policy error and the policy will fail to
load. A validator is never silently skipped.
| Validator | Description |
|---|---|
luhn |
Standard mod-10 Luhn checksum over the digits of the match. Separators (spaces, hyphens) are ignored, so a value may be formatted or unformatted. Used by identifiers such as the Canadian SIN, French SIREN, and SIRET. |
bic-structural |
Structural check for a SWIFT/BIC code (ISO 9362), which has no checksum: 4 letters (institution), 2 letters (country), 2 alphanumeric (location), and an optional 3 alphanumeric (branch), for a length of 8 or 11. The country segment must be a valid ISO 3166-1 alpha-2 code. The check is case-insensitive and takes no parameters. |
de-personalausweis |
ICAO 9303 check-digit validation for a German Personalausweis (national ID card) number: a 9-character document number followed by one check digit, for a length of 10. Each document-number character is valued (digits 0-9, letters A-Z map to 10-35), weighted by the repeating 7, 3, 1 pattern, summed, and reduced mod 10 to equal the trailing check digit. The check is case-insensitive and takes no parameters. |
de-steuerid |
Validates a German tax identification number (Steuer-ID / IdNr): 11 digits, first digit non-zero. Applies the structural digit-repetition rule on the first ten digits (exactly one digit repeated, twice or three times, the rest distinct) and the ISO/IEC 7064 MOD 11,10 check digit. Whitespace and . / - separators are ignored; takes no parameters. |
mod11 |
Weighted-sum mod-11 check digit(s). Requires a variant parameter selecting the scheme: cpf (Brazilian CPF, 11 digits) or cnpj (Brazilian CNPJ, 14 digits), each with two check digits. Non-digit separators are ignored; sequences of a single repeated digit are rejected. |
mod97 |
Control derived from the value mod 97. Requires a variant parameter: iban (ISO 13616 / MOD-97-10, value mod 97 equals 1) or nir (French INSEE/NIR, key = 97 - (body mod 97)). For nir, a substitutions parameter maps Corsica department codes to digits (default 2A to 19, 2B to 18). |
mod23-letter |
Control letter taken from a 23-entry table indexed by the number mod 23, for the Spanish DNI (8 digits plus a letter) and NIE (leading X/Y/Z plus 7 digits plus a letter). A substitutions parameter maps the leading NIE letter to a digit (default X to 0, Y to 1, Z to 2). |
es-cif |
Bespoke validator for the Spanish CIF (organization tax ID): a leading organization-type letter, seven digits, and a control character that is a digit or a letter (table JABCDEFGHI) derived from a Luhn-like weighted sum. Takes no parameters. |
The
luhnvalidator implements the standard Luhn algorithm only. La Poste SIRETs are a known exception (they are validated by a digit-sum mod 5 rather than Luhn) and will not pass this check.
For example, with "validator": "luhn" a nine-digit pattern keeps 046 454 286 (Luhn-valid) but
drops 123 456 789 (same shape, fails the checksum). Without the validator, both would be redacted.
Conditions
Each filter strategy may have one condition. See Conditions for details.
| Conditional | Description | Operators |
|---|---|---|
TOKEN |
Compares the value of the sensitive text. | == , != |
CONTEXT |
Compares the filtering context. | == , != |
CONFIDENCE |
Compares the confidence in the sensitive text against a threshold value. | < , <=, > , >=, ==, != |
Example Policy
{
"name": "default",
"identifiers": {
"identifiers": [
{
"pattern": "[A-Z]{9}",
"caseSensitive": false,
"classification": "custom-identifier",
"enabled": true,
"identifierFilterStrategies": [
{
"strategy": "REDACT",
"redactionFormat": "{{{REDACTED-%t}}}"
}
]
}
]
}
}
Example Policy with a Validator
This identifier matches a nine-digit Canadian SIN, formatted or unformatted, and keeps only the matches that pass the Luhn checksum.
{
"name": "default",
"identifiers": {
"identifiers": [
{
"pattern": "\\b\\d{3}[ -]?\\d{3}[ -]?\\d{3}\\b",
"caseSensitive": false,
"classification": "canada-sin",
"validator": "luhn",
"enabled": true,
"identifierFilterStrategies": [
{
"strategy": "REDACT",
"redactionFormat": "{{{REDACTED-%t}}}"
}
]
}
]
}
}