Monitoring
The proxy exposes a Prometheus metrics endpoint on a dedicated port (default 9090) so it can be firewalled separately from the proxy port. Metrics are enabled by default.
Configuration
metrics:
enabled: true
port: 9090
To disable metrics:
metrics:
enabled: false
Scraping with Prometheus
Add the proxy as a scrape target in your prometheus.yml:
scrape_configs:
- job_name: philter-ai-proxy
static_configs:
- targets: ['your-proxy-host:9090']
Available Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
philter_proxy_requests_total |
Counter | provider, status_code, policy |
Total requests proxied |
philter_proxy_request_duration_seconds |
Histogram | provider |
End-to-end request latency |
philter_proxy_redaction_duration_seconds |
Histogram | provider, policy |
Time spent on Philter redaction calls |
philter_proxy_entities_redacted_total |
Counter | entity_type, provider |
Total entities redacted |
philter_proxy_prompt_tokens_total |
Counter | provider, model |
Total prompt (input) tokens reported by providers |
philter_proxy_completion_tokens_total |
Counter | provider, model |
Total completion (output) tokens reported by providers |
philter_proxy_philter_errors_total |
Counter | - | Failed calls to the Philter backend |
philter_proxy_upstream_errors_total |
Counter | provider, status_code |
Failed calls to LLM providers |
philter_proxy_active_requests |
Gauge | - | Currently in-flight requests (those holding a concurrency slot) |
philter_proxy_concurrency_limit |
Gauge | scope |
Configured max-concurrent-requests ceiling. 0 means unlimited. |
philter_proxy_concurrency_shed_total |
Counter | scope |
Requests rejected (HTTP 503) due to the concurrency guard |
Token counters are populated from each provider's native usage response field. They are not incremented for streaming responses, since token counts are not reliably available mid-stream.
Label values
provider: openai, anthropic, gemini, ollama, bedrock, or the name of any configured OpenAI-compatible provider
entity_type: Philter entity type string, e.g. NER_ENTITY, SSN, PHONE_NUMBER, EMAIL_ADDRESS. The full list depends on your Philter policy configuration.
policy: The Philter policy name matched by the route, e.g. default, hipaa-safe-harbor.
scope (on concurrency metrics): global for the proxy-wide cap, per_key for per-API-key caps.
Health Endpoints
The proxy exposes three HTTP endpoints on the proxy port (not the metrics port) for use as load-balancer health checks and Kubernetes probes.
/livez (liveness)
Always returns 200 OK with body {"status":"ok"} as long as the process is running and the listener is accepting connections. Does not probe Philter - this is the endpoint to point a Kubernetes liveness probe at, so transient upstream blips don't trigger pod restarts.
/readyz (readiness)
Returns 200 OK with body {"status":"ok"} when the proxy is willing to accept traffic, or 503 Service Unavailable with body {"status":"not_ready","reason":"philter_circuit_open"} when the Philter circuit breaker is open AND configured to block. In every other state (no breaker configured, breaker closed, breaker half-open, or breaker open with fallback: passthrough) the proxy is considered ready: individual requests may still fail but Kubernetes should NOT shed traffic from the pod.
Does not probe Philter - the breaker's existing state is the source of truth. This keeps readiness cheap and avoids adding load to a struggling Philter.
Use this as a Kubernetes readiness probe.
/health (deprecated)
Retained for backwards compatibility. Returns 200 OK with {"status":"ok","philter":"ok"} when Philter is reachable; 503 with {"status":"degraded","philter":"unreachable"} when not. Unlike /readyz, this endpoint makes an active outbound probe to Philter on every call (2-second timeout).
Deprecated in favor of /livez and /readyz. New deployments should use the split endpoints; treating Philter unreachability as a liveness failure causes Kubernetes to restart healthy pods during transient outages, which is precisely the failure mode the split was introduced to fix.
Distributed Tracing
The proxy emits OpenTelemetry spans for every inbound request, with child spans for each call to Philter and each upstream LLM provider. Trace context is propagated to the upstream via the W3C traceparent header, so a request traversing the proxy can be viewed end-to-end in any APM (Jaeger, Honeycomb, Datadog, Grafana Tempo, etc.).
Tracing is disabled by default. With the SDK off the proxy pays zero per-request tracing overhead.
Enabling tracing
Two things must be true for spans to start flowing:
- Set
tracing.enabled: truein the config (this initialises the OTel SDK). - Set the standard OTel env vars to point at your collector AND tell the SDK to actually sample. Even with
tracing.enabled: truethe default sampler isalways_off, so spans are only emitted when the operator explicitly opts in viaOTEL_TRACES_SAMPLER.
tracing:
enabled: true
serviceName: philter-ai-proxy # optional; defaults to "philter-ai-proxy"
export OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf # or "grpc" for port 4317
export OTEL_TRACES_SAMPLER=parentbased_always_on # see samplers below
Recognised env vars
The proxy honours the standard OTel SDK env vars:
| Env var | Effect |
|---|---|
OTEL_EXPORTER_OTLP_ENDPOINT |
Collector URL. Required when tracing is enabled. |
OTEL_EXPORTER_OTLP_PROTOCOL |
http/protobuf (default) or grpc. |
OTEL_EXPORTER_OTLP_HEADERS |
Comma-separated key=value headers (e.g., auth tokens). |
OTEL_EXPORTER_OTLP_INSECURE |
true to skip TLS for gRPC exporters. |
OTEL_SERVICE_NAME |
Overrides tracing.serviceName when set. |
OTEL_RESOURCE_ATTRIBUTES |
Extra resource attributes, e.g. deployment.environment=prod. |
OTEL_TRACES_SAMPLER |
always_off (default), always_on, parentbased_always_on, parentbased_always_off, traceidratio, parentbased_traceidratio. |
OTEL_TRACES_SAMPLER_ARG |
Argument for ratio samplers, e.g. 0.1 for 10% sampling. |
Spans the proxy emits
| Span | When |
|---|---|
proxy.request {METHOD} {PATH} |
Root span per inbound request, created by otelhttp.NewHandler. Honors an inbound traceparent header. |
philter.filter |
Each call to Philter's /api/explain (inbound redaction + outbound scan). |
provider.{name} |
Each call to an upstream provider (openai, anthropic, gemini, ollama, bedrock, or any configured openaiCompatible name). |
Child spans inherit the inbound trace ID so the whole request appears as one trace in your APM.
Correlating trace IDs with audit logs
Every audit log entry includes a trace_id field when tracing is active and the request was sampled. Use it to jump from a slow audit-log entry to the full distributed trace in your APM, or vice versa:
{"time":"...","msg":"request","request_id":"...","provider":"openai","http_status":200,"trace_id":"11112222333344445555666677778888",...}
When tracing is disabled or the request was not sampled, trace_id is omitted from the audit entry.
Grafana Dashboard
A pre-built dashboard covering every metric in the table above is shipped at deploy/grafana/philter-ai-proxy.json. Import it via Grafana → Dashboards → New → Import and pick the Prometheus datasource that's scraping philter_proxy_*. The dashboard exposes a datasource template variable so the same JSON works across environments.
If you'd rather build your own, the recipes below are the queries the bundled dashboard uses.
Recommended panels
Request rate (requests per second by provider):
sum by (provider) (rate(philter_proxy_requests_total[5m]))
Error rate (% of requests that failed):
sum(rate(philter_proxy_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(philter_proxy_requests_total[5m]))
p95 request latency by provider:
histogram_quantile(0.95, sum by (provider, le) (rate(philter_proxy_request_duration_seconds_bucket[5m])))
p95 redaction latency by policy:
histogram_quantile(0.95, sum by (policy, le) (rate(philter_proxy_redaction_duration_seconds_bucket[5m])))
Entities redacted per minute by type:
sum by (entity_type) (rate(philter_proxy_entities_redacted_total[1m])) * 60
Token throughput by provider (tokens per minute):
sum by (provider) (rate(philter_proxy_prompt_tokens_total[5m]) + rate(philter_proxy_completion_tokens_total[5m])) * 60
Prompt vs. completion token split by model:
sum by (model) (rate(philter_proxy_prompt_tokens_total[5m]))
sum by (model) (rate(philter_proxy_completion_tokens_total[5m]))
Cumulative tokens by provider (useful for cost attribution dashboards):
sum by (provider) (philter_proxy_prompt_tokens_total + philter_proxy_completion_tokens_total)
Philter backend error rate:
rate(philter_proxy_philter_errors_total[5m])
Active in-flight requests:
philter_proxy_active_requests
Concurrency
Utilization (% of the global concurrency ceiling currently in use) - only meaningful when listen.maxConcurrentRequests > 0:
philter_proxy_active_requests
/ on() philter_proxy_concurrency_limit{scope="global"}
Sustained shed rate by scope (rejections/sec from the concurrency guard):
sum by (scope) (rate(philter_proxy_concurrency_shed_total[5m]))
If scope="global" is rising, you have a real capacity problem - scale out horizontally first rather than raising the cap. If only scope="per_key" is rising, talk to that tenant or raise their per-key cap; the global pool is fine.
Alerting rules
groups:
- name: philter-ai-proxy
rules:
- alert: PhilterBackendDown
expr: rate(philter_proxy_philter_errors_total[5m]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Philter backend is returning errors"
- alert: HighUpstreamErrorRate
expr: |
sum(rate(philter_proxy_upstream_errors_total[5m])) /
sum(rate(philter_proxy_requests_total[5m])) > 0.05
for: 2m
labels:
severity: warning
annotations:
summary: "More than 5% of upstream requests are failing"
- alert: HighRedactionLatency
expr: |
histogram_quantile(0.95, sum by (le) (rate(philter_proxy_redaction_duration_seconds_bucket[5m]))) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Philter redaction p95 latency exceeds 1 second"
- alert: ConcurrencyGuardShedding
expr: rate(philter_proxy_concurrency_shed_total{scope="global"}[5m]) > 0
for: 2m
labels:
severity: warning
annotations:
summary: "Proxy is shedding requests at the global concurrency cap - scale out or raise listen.maxConcurrentRequests"