PhEye Filter - AI-Powered Named Entity Recognition

The PhEye filter provides flexible AI-powered named entity recognition (NER) for detecting persons, organizations, locations, and other custom entities in text. It supports two modes of operation:

Remote Service Mode: Connects to a remote PhEye NLP service via HTTP
Local Model Mode: Uses a local ONNX BERT-based NER model for offline inference

Features

Dual Operation Modes: Choose between remote service or local model inference
Named Entity Recognition: Detects persons, organizations, locations, and custom entity types
BERT Tokenization: Built-in WordPiece tokenizer for BERT models (local mode)
ONNX Runtime: Fast local inference using Microsoft's ONNX Runtime
Entity Grouping: Automatically groups B- and I- tags into complete entities
Confidence Scoring: Provides confidence scores for each detection
Configurable Thresholds: Filter entities based on confidence levels per label
Bearer Token Authentication: Secure communication with remote PhEye services

Remote Service Mode

Setup

Deploy a PhEye service or use an existing endpoint.

Configuration

using Phileas.Policy;
using Phileas.Policy.Filters;
using Phileas.Services;
using PhileasPolicy = Phileas.Policy.Policy;

var policy = new PhileasPolicy
{
    Name = "pheye-remote-policy",
    Identifiers = new Identifiers
    {
        PhEyes = new List<PhEye>
        {
            new PhEye
            {
                PhEyeConfiguration = new PhEyeConfiguration
                {
                    Endpoint = "http://localhost:8080",
                    BearerToken = "your-api-token",  // Optional
                    Timeout = 30,                     // Seconds
                    Labels = new List<string> { "PERSON", "ORG", "LOC" }
                }
            }
        }
    }
};

JSON Configuration

{
  "name": "pheye-remote-policy",
  "identifiers": {
    "pheye": [
      {
        "phEyeConfiguration": {
          "endpoint": "http://localhost:8080",
          "bearerToken": "your-api-token",
          "timeout": 30,
          "labels": ["PERSON", "ORG", "LOC"]
        },
        "removePunctuation": false
      }
    ]
  }
}

Usage Example

var filterService = new FilterService();

var result = filterService.Filter(
    policy: policy,
    context: "default",
    piece: 0,
    input: "John Smith works at Microsoft in Seattle."
);

Console.WriteLine(result.FilteredText);
// Output: {{{REDACTED-person}}} works at {{{REDACTED-other}}} in {{{REDACTED-location-city}}}.

Local Model Mode

Setup

1. Download the ONNX Model

Download a BERT NER ONNX model from HuggingFace, such as protectai/bert-base-NER-onnx:

# Clone the model repository
git clone https://huggingface.co/protectai/bert-base-NER-onnx

# Or download specific files:
# - model.onnx (the ONNX model file)
# - vocab.txt (the BERT vocabulary file)

2. Configure the Filter

var policy = new PhileasPolicy
{
    Name = "pheye-local-policy",
    Identifiers = new Identifiers
    {
        PhEyes = new List<PhEye>
        {
            new PhEye
            {
                PhEyeConfiguration = new PhEyeConfiguration
                {
                    ModelPath = "C:\\models\\model.onnx",
                    VocabPath = "C:\\models\\vocab.txt",
                    Labels = new List<string> { "PER", "ORG", "LOC", "MISC" }
                }
            }
        }
    }
};

3. JSON Configuration

{
  "name": "pheye-local-policy",
  "identifiers": {
    "pheye": [
      {
        "phEyeConfiguration": {
          "modelPath": "C:\\models\\model.onnx",
          "vocabPath": "C:\\models\\vocab.txt",
          "labels": ["PER", "ORG", "LOC", "MISC"]
        },
        "removePunctuation": false
      }
    ]
  }
}

Mixed Configuration

If both modelPath/vocabPath and endpoint are provided, the filter will prefer the local model. If only modelPath or only vocabPath is set (but not both), the filter falls back to the remote service.

new PhEyeConfiguration
{
    ModelPath = "C:\\models\\model.onnx",
    VocabPath = "C:\\models\\vocab.txt",
    Endpoint = "http://localhost:8080",  // Fallback if local model fails to load
    Labels = new List<string> { "PERSON" }
}

Configuration Options

PhEyeConfiguration Properties

Property	Type	Default	Description
`Endpoint`	`string`	`"http://localhost:8080"`	Base URL of the PhEye service (remote mode)
`BearerToken`	`string?`	`null`	Bearer token for API authentication (remote mode)
`Timeout`	`int`	`30`	Request timeout in seconds (remote mode)
`Labels`	`List<string>`	`["Person"]`	Entity labels to detect
`ModelPath`	`string?`	`null`	Path to ONNX model file (local mode)
`VocabPath`	`string?`	`null`	Path to BERT vocabulary file (local mode)

PhEye Filter Properties

Property	Type	Default	Description
`RemovePunctuation`	`bool`	`false`	Strip punctuation before processing
`Strategies`	`List<PhEyeFilterStrategy>`	`[REDACT]`	Replacement strategies for detected entities
`Ignored`	`List<string>`	`[]`	Terms to ignore during detection
`IgnoredPatterns`	`List<IgnoredPattern>`	`[]`	Regex patterns to ignore
`Priority`	`int`	`0`	Filter priority for overlapping spans

Supported Entity Types

The filter maps entity labels to Phileas FilterType enums:

Entity Label	FilterType	Description
`PERSON`, `PER`	`FilterType.Person`	Person names
`LOCATION`, `LOC`	`FilterType.LocationCity`	Location names
`ORGANIZATION`, `ORG`	`FilterType.Other`	Organization names
`MISC`	`FilterType.Other`	Miscellaneous entities

Custom labels not in this list are mapped to FilterType.Other.

Confidence Thresholds

You can set minimum confidence thresholds per label to filter out low-confidence predictions:

var thresholds = new Dictionary<string, double>
{
    { "PERSON", 0.90 },
    { "ORG", 0.85 },
    { "LOC", 0.80 }
};

// Note: Thresholds are typically configured via filter strategies
// or custom filter initialization when using the PhEye filter directly

When using filter strategies:

PhEyes = new List<PhEye>
{
    new PhEye
    {
        PhEyeConfiguration = new PhEyeConfiguration
        {
            Endpoint = "http://localhost:8080",
            Labels = new List<string> { "PERSON" }
        },
        Strategies = new List<PhEyeFilterStrategy>
        {
            new PhEyeFilterStrategy
            {
                Strategy = "REDACT",
                Condition = new Condition { Confidence = 0.90 }  // Minimum confidence
            }
        }
    }
}

Multiple PhEye Configurations

You can configure multiple PhEye instances in a single policy, each with different endpoints or model configurations:

PhEyes = new List<PhEye>
{
    new PhEye
    {
        PhEyeConfiguration = new PhEyeConfiguration
        {
            Endpoint = "http://pheye-persons:8080",
            Labels = new List<string> { "PERSON" }
        }
    },
    new PhEye
    {
        PhEyeConfiguration = new PhEyeConfiguration
        {
            Endpoint = "http://pheye-orgs:8080",
            Labels = new List<string> { "ORGANIZATION" }
        }
    }
}

Filter Strategies

The PhEye filter supports all standard Phileas strategies:

using Phileas.Filters;
using Phileas.Policy.Filters.Strategies;

PhEyes = new List<PhEye>
{
    new PhEye
    {
        PhEyeConfiguration = new PhEyeConfiguration
        {
            Endpoint = "http://localhost:8080",
            Labels = new List<string> { "PERSON" }
        },
        Strategies = new List<PhEyeFilterStrategy>
        {
            // Mask person names
            new PhEyeFilterStrategy { Strategy = AbstractFilterStrategy.Mask },

            // Or use static replacement
            new PhEyeFilterStrategy
            {
                Strategy = AbstractFilterStrategy.StaticReplace,
                Replacement = "[NAME REMOVED]"
            }
        }
    }
}

See Filter Strategies for all available options.

Ignored Terms

Configure terms that should not be redacted:

PhEyes = new List<PhEye>
{
    new PhEye
    {
        PhEyeConfiguration = new PhEyeConfiguration
        {
            Endpoint = "http://localhost:8080",
            Labels = new List<string> { "PERSON" }
        },
        Ignored = new List<string> { "John", "Microsoft", "MIT" }
    }
}

Performance Considerations

Remote Service Mode

Network Latency: Processing time depends on network speed and service location
Scalability: PhEye service can be scaled horizontally
Resource Usage: Minimal local resources required
Throughput: Depends on service capacity and configuration

Local Model Mode

Model Size: BERT-base models are typically ~400MB
Memory Usage: Model must be loaded into memory (~400MB RAM)
Inference Speed: Processing time depends on text length and CPU/GPU
Token Limit: Maximum sequence length is 512 tokens (BERT limit)
No Network: Operates completely offline

Choosing a Mode

Factor	Remote Service	Local Model
Setup Complexity	Easy	Moderate
Network Required	Yes	No
Privacy	Data leaves host	Data stays local
Scalability	High	Limited by host resources
Latency	Variable	Consistent
Resource Usage	Low	Moderate-High

Example Scenarios

Medical Records Processing

var policy = new PhileasPolicy
{
    Name = "medical-ner",
    Identifiers = new Identifiers
    {
        PhEyes = new List<PhEye>
        {
            new PhEye
            {
                PhEyeConfiguration = new PhEyeConfiguration
                {
                    ModelPath = "C:\\models\\medical-ner.onnx",
                    VocabPath = "C:\\models\\vocab.txt",
                    Labels = new List<string> { "PERSON", "CONDITION", "MEDICATION", "PROCEDURE" }
                }
            }
        }
    }
};

var text = "Dr. Sarah Johnson prescribed metformin to treat the patient's diabetes.";
var result = new FilterService().Filter(policy, "ctx", 0, text);

Multi-Language Support

PhEyes = new List<PhEye>
{
    new PhEye
    {
        PhEyeConfiguration = new PhEyeConfiguration
        {
            Endpoint = "http://pheye-english:8080",
            Labels = new List<string> { "PERSON", "ORG", "LOC" }
        }
    },
    new PhEye
    {
        PhEyeConfiguration = new PhEyeConfiguration
        {
            Endpoint = "http://pheye-spanish:8080",
            Labels = new List<string> { "PERSON", "ORG", "LOC" }
        }
    }
}

Troubleshooting

Remote Service Issues

Connection Timeout - Verify the endpoint URL is correct and accessible - Check network connectivity and firewall rules - Increase the Timeout value if the service is slow

Authentication Errors - Ensure the BearerToken is correct - Verify the token has not expired

No Entities Detected - Confirm the Labels list matches the model's output labels - Check service logs for errors

Local Model Issues

Model Loading Errors - Verify the ModelPath and VocabPath are correct - Ensure the ONNX model format is compatible with ONNX Runtime 1.20.1+ - Check file permissions

OutOfMemoryException - The BERT model requires ~400MB RAM minimum - Close other applications or increase available memory

Poor Detection Quality - Verify the model matches your text domain (general, medical, legal, etc.) - Adjust confidence thresholds - Consider fine-tuning the model on domain-specific data

Partial Configuration Fallback - If only ModelPath or VocabPath is set, the filter uses remote service - Ensure both paths are provided for local inference

Resource Cleanup

The PhEye filter implements IDisposable for proper resource cleanup:

using var filterService = new FilterService();
// Filter operations...
// Resources automatically cleaned up

When manually creating filters:

using var filter = new PhEyeFilter(config, phEyeConfig, false, thresholds);
// Use the filter...
// Automatically disposes ONNX session and HTTP client

Integration with Phileas Pipeline

The PhEye filter integrates seamlessly with other Phileas filters:

var policy = new PhileasPolicy
{
    Name = "comprehensive-pii",
    Identifiers = new Identifiers
    {
        // AI-powered entity detection
        PhEyes = new List<PhEye>
        {
            new PhEye
            {
                PhEyeConfiguration = new PhEyeConfiguration
                {
                    Endpoint = "http://localhost:8080",
                    Labels = new List<string> { "PERSON", "ORG" }
                }
            }
        },

        // Pattern-based detectors
        EmailAddress = new EmailAddress(),
        PhoneNumber = new PhoneNumber(),
        Ssn = new Ssn(),
        CreditCard = new CreditCard()
    }
};

Next Steps

Read about Filter Strategies to customize redaction behavior
Learn about Filter Conditions for conditional redaction
Explore the API Reference for detailed method documentation
Check out the PhEye service documentation for service setup

Model Information

Recommended Models

General NER: protectai/bert-base-NER-onnx
Medical NER: Fine-tuned BERT models for medical domain
Custom Models: Train and export your own BERT NER models to ONNX format

Model Requirements

Format: ONNX
Architecture: BERT-based token classification
Input tensors: input_ids, attention_mask, token_type_ids
Output tensor: logits with shape [batch_size, sequence_length, num_labels]
Vocabulary: WordPiece vocabulary file compatible with BERT tokenizer

Questions?

Visit the Phileas documentation or the GitHub repository for more information.