Skip to content

Identifier

Filter

This filter identifies custom text based on a given regular expression.

The Identifier filter accepts a list of regular expression-based identifiers. See the policy at the bottom of this page for an example.

Note that backslashes in the regular expression will need to be escaped for the policy to be valid JSON.

Because the pattern is a user-supplied regular expression, each match attempt is time-bounded to guard against catastrophic backtracking (ReDoS). If a pattern exceeds the budget on a given input, matching is aborted and that input yields no matches for the identifier. The budget is controlled by the regex.timeout.ms setting (default 1000 ms).

Required Parameters

This filter has no required parameters.

Optional Parameters

Parameter Description Default Value
enabled When set to false, the filter will be disabled and not applied true
ignored A list of terms to be ignored by the filter. None
caseSensitive When set to true, the regular expression will be case sensitive. true
classification Used to apply an arbitrary label to the identifier, such as "patient-id", or "account-number." "custom-identifier"
pattern A regular expression for the identifier. Note that backslashes will need to be escaped. \b[A-Z0-9_-]{4,}\b
groupNumber The regular expression capture group to extract as the identifier (0 is the entire match). Use a capture group when you want to match on surrounding context but redact only part of the match. 0
validator An optional named, post-match validator. A match is kept only if the validator passes, so a generic identifier can reject format-valid but invalid values (for example a checksum-invalid number). See Validators below. None
windowSize Sets the size of the window (in terms) surrounding a span to look for contextual terms. If set, this value overrides the value of span.window.size in the configuration. The value of span.window.size which is by default 5.
priority The priority (integer) of this filter. Valid values are any positive integer, where a higher value indicates a higher priority. Priority is used for tie-breaking when two spans may be otherwise identical. 0

Filter Strategies

The filter may have zero or more filter strategies. When no filter strategy is given the default strategy of REDACT is used. When multiple filter strategies are given the filter strategies will be applied in as they are listed. See Filter Strategies for details.

Strategy Description
REDACT Replace the sensitive text with a placeholder.
MASK Replace each character of the sensitive text with a mask character (* by default).
TRUNCATE Replace all but a few characters of the sensitive text with a truncation character (* by default).
RANDOM_REPLACE Replace the sensitive text with a similar, random value.
STATIC_REPLACE Replace the sensitive text with a given value.
CRYPTO_REPLACE Replace the sensitive text with its encrypted value.
HASH_SHA256_REPLACE Replace the sensitive text with its SHA256 hash value.
FPE_ENCRYPT_REPLACE Replace the sensitive text with a value generated by format-preserving encryption (FPE)
LAST_4 Replace the sensitive text with just the last four characters of the text.

Validators

A regular expression matches a format, but it cannot tell a valid identifier from a value that merely has the same shape. The optional validator runs a named, built-in check on each match and keeps the match only if the check passes. This lets a generic identifier reject format-valid but invalid values without a dedicated filter and without embedding any executable code in the policy.

The validator may be written in either of two forms:

"validator": "luhn"
"validator": { "name": "luhn", "params": { } }

The object form is used by validators that take parameters, for example:

"validator": { "name": "mod11", "params": { "variant": "cpf" } }

For a validator that takes no parameters, such as luhn, the string and object forms are equivalent. The validator name must be one defined by the redaction policy schema. An unknown name, or a name the current build does not implement, is a policy error and the policy will fail to load. A validator is never silently skipped.

Validator Description
luhn Standard mod-10 Luhn checksum over the digits of the match. Separators (spaces, hyphens) are ignored, so a value may be formatted or unformatted. Used by identifiers such as the Canadian SIN, French SIREN, and SIRET.
bic-structural Structural check for a SWIFT/BIC code (ISO 9362), which has no checksum: 4 letters (institution), 2 letters (country), 2 alphanumeric (location), and an optional 3 alphanumeric (branch), for a length of 8 or 11. The country segment must be a valid ISO 3166-1 alpha-2 code. The check is case-insensitive and takes no parameters.
de-personalausweis ICAO 9303 check-digit validation for a German Personalausweis (national ID card) number: a 9-character document number followed by one check digit, for a length of 10. Each document-number character is valued (digits 0-9, letters A-Z map to 10-35), weighted by the repeating 7, 3, 1 pattern, summed, and reduced mod 10 to equal the trailing check digit. The check is case-insensitive and takes no parameters.
de-steuerid Validates a German tax identification number (Steuer-ID / IdNr): 11 digits, first digit non-zero. Applies the structural digit-repetition rule on the first ten digits (exactly one digit repeated, twice or three times, the rest distinct) and the ISO/IEC 7064 MOD 11,10 check digit. Whitespace and . / - separators are ignored; takes no parameters.
mod11 Weighted-sum mod-11 check digit(s). Requires a variant parameter selecting the scheme: cpf (Brazilian CPF, 11 digits) or cnpj (Brazilian CNPJ, 14 digits), each with two check digits. Non-digit separators are ignored; sequences of a single repeated digit are rejected.
mod97 Control derived from the value mod 97. Requires a variant parameter: iban (ISO 13616 / MOD-97-10, value mod 97 equals 1) or nir (French INSEE/NIR, key = 97 - (body mod 97)). For nir, a substitutions parameter maps Corsica department codes to digits (default 2A to 19, 2B to 18).
mod23-letter Control letter taken from a 23-entry table indexed by the number mod 23, for the Spanish DNI (8 digits plus a letter) and NIE (leading X/Y/Z plus 7 digits plus a letter). A substitutions parameter maps the leading NIE letter to a digit (default X to 0, Y to 1, Z to 2).
es-cif Bespoke validator for the Spanish CIF (organization tax ID): a leading organization-type letter, seven digits, and a control character that is a digit or a letter (table JABCDEFGHI) derived from a Luhn-like weighted sum. Takes no parameters.

The luhn validator implements the standard Luhn algorithm only. La Poste SIRETs are a known exception (they are validated by a digit-sum mod 5 rather than Luhn) and will not pass this check.

For example, with "validator": "luhn" a nine-digit pattern keeps 046 454 286 (Luhn-valid) but drops 123 456 789 (same shape, fails the checksum). Without the validator, both would be redacted.

Conditions

Each filter strategy may have one condition. See Conditions for details.

Conditional Description Operators
TOKEN Compares the value of the sensitive text. == , !=
CONTEXT Compares the filtering context. == , !=
CONFIDENCE Compares the confidence in the sensitive text against a threshold value. < , <=, > , >=, ==, !=

Example Policy

{
  "name": "default",
  "identifiers": {
    "identifiers": [
      {
        "pattern": "[A-Z]{9}",
        "caseSensitive": false,
        "classification": "custom-identifier",
        "enabled": true,
        "identifierFilterStrategies": [
          {
            "strategy": "REDACT",
            "redactionFormat": "{{{REDACTED-%t}}}"
          }
        ]        
      }
    ]
  }
}

Example Policy with a Validator

This identifier matches a nine-digit Canadian SIN, formatted or unformatted, and keeps only the matches that pass the Luhn checksum.

{
  "name": "default",
  "identifiers": {
    "identifiers": [
      {
        "pattern": "\\b\\d{3}[ -]?\\d{3}[ -]?\\d{3}\\b",
        "caseSensitive": false,
        "classification": "canada-sin",
        "validator": "luhn",
        "enabled": true,
        "identifierFilterStrategies": [
          {
            "strategy": "REDACT",
            "redactionFormat": "{{{REDACTED-%t}}}"
          }
        ]
      }
    ]
  }
}