FAQ

What is the Philter AI Proxy?

The Philter AI Proxy is a proxy for OpenAI, Anthropic (Claude), Google Gemini, Ollama, and Amazon Bedrock that uses Philter to remove PII, PHI, and other sensitive information from LLM requests before they are sent to the provider.

Why should I use it?

By using the proxy, you ensure that sensitive information never leaves your environment and is not sent to the AI providers, helping you maintain compliance and protect privacy. Every request produces a structured audit log for compliance reporting.

Which AI providers are supported?

The proxy supports:

OpenAI
Anthropic (Claude)
Google Gemini
Ollama
Amazon Bedrock (Converse and ConverseStream APIs)
Any OpenAI-compatible provider (Mistral, Cohere, vLLM, LM Studio, etc.) via providers.openaiCompatible

Both streaming and non-streaming requests are supported for all providers.

Does it support streaming?

Yes. Streaming responses (SSE for OpenAI/Anthropic, chunked JSON for Gemini, NDJSON for Ollama) are forwarded to the client in real time without buffering. Inbound prompt redaction works identically for streaming and non-streaming requests.

Outbound response scanning is not applied to streaming responses - the stream is forwarded to the client unchanged and a warning is logged. Outbound scanning applies only to non-streaming responses.

Is any sensitive data logged?

No. The audit log contains only metadata (provider, model, entity types, counts, latency, etc.). No message content or filtered text is ever logged. Client IP addresses are included, which may be considered personal data under GDPR.

Do I need a Philter instance?

Yes, the proxy requires a running instance of Philter to perform the redaction. You can launch one in your cloud or on-premise. Visit philterd.ai for more information.

Can the proxy scan LLM responses for PII, not just requests?

Yes. Outbound response scanning is supported on an opt-in basis. When enabled, the proxy buffers the LLM's response, passes it through Philter, and returns the result to the client. The behavior when PII is detected is configurable: redact (replace PII tokens), block (return HTTP 403), or flag (pass through with a warning log).

Outbound scanning is disabled by default because it adds a Philter round-trip after the provider responds. Enable it only on routes where compliance requires it. See Configuration for details.

Does outbound scanning add latency?

Yes. When outbound scanning is enabled, the proxy must buffer the full provider response and make an additional request to Philter before returning the response to the client. The added latency equals roughly one Philter round-trip (typically low-double-digit milliseconds on local deployments).

Streaming responses are not scanned - they are forwarded immediately - so streaming requests have no outbound latency overhead.

How do I configure the proxy?

The proxy is configured via a YAML configuration file. Please refer to the Configuration page for all available settings.

Can I deploy the proxy to Kubernetes?

Yes. A production-ready Helm chart lives at deploy/helm/philter-ai-proxy/ in the repo, and plain manifests for non-Helm users at deploy/k8s/. The chart supports replicas, autoscaling, Pod Disruption Budgets, optional Ingress, Prometheus Operator ServiceMonitor, mTLS, and TLS issuance via either an existing Secret or cert-manager. A starter Grafana dashboard at deploy/grafana/philter-ai-proxy.json covers every emitted metric. See the Kubernetes Quickstart for the full walkthrough.

Is the proxy open source?

Yes, the Philter AI Proxy is licensed under the Apache License, version 2.

Does the proxy support authentication?

Yes. API key authentication and mTLS are both supported, and both are disabled by default.

For API key authentication, configure one or more keys under auth.apiKeys in the config. Clients send the key in the x-philter-proxy-key header (configurable). Requests without a valid key receive HTTP 401. Each key can optionally be bound to a specific Philter policy, which lets an admin issue a key to the healthcare team that always uses the HIPAA policy regardless of what the client requests. The proxy's auth header is always stripped before forwarding, so LLM providers never see it.

For zero-trust service-to-service authentication, set listen.clientCA to a CA certificate. The proxy will require and verify a client TLS certificate on every connection. API key auth and mTLS can be used simultaneously. See Configuration for details and examples.

Where can I see throughput and latency numbers for the proxy?

A k6 load-test harness lives at test/load/ in the repo, with a self-contained docker-compose stack (Philter + a stub LLM provider + the proxy) and five scenarios covering inbound redaction, outbound response scanning, streaming, and a no-proxy baseline for comparison. A reference baseline measured on a single-host Intel i5-11400 - including the OpenAI proxy path at ~2,900 req/s p95=8.8ms, and outbound-scan at ~1,400 req/s p95=32ms - is published at Load tests. A scheduled GitHub Actions workflow re-runs the harness weekly and uploads summary JSONs as artifacts.

Does the proxy support OpenTelemetry tracing?

Yes. With tracing.enabled: true in the config, the proxy emits OTLP spans: one root span per inbound request, child spans for each Philter call and each upstream LLM provider call. Trace context is propagated to the upstream via the W3C traceparent header, so end-to-end traces work across the proxy in any APM (Jaeger, Honeycomb, Datadog, Grafana Tempo, etc.). Exporter destination, protocol, headers, and sampler are configured via the standard OTel env vars (OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_TRACES_SAMPLER, etc.).

Tracing is off by default. Even with tracing.enabled: true the default sampler is always_off, so spans only flow when an operator explicitly sets OTEL_TRACES_SAMPLER. The trace_id appears in audit log entries when a request is sampled so APM traces and audit lines can be cross-referenced by ID. See Distributed Tracing for the full reference.

What endpoints should I use for Kubernetes probes?

Point liveness at /livez and readiness at /readyz. /livez always returns 200 as long as the process is running, so transient Philter outages don't restart healthy pods. /readyz returns 503 only when the Philter circuit breaker is open with fallback: block - the proxy can't serve any traffic in that state. In every other state (no breaker, breaker closed, half-open, or open with fallback: passthrough) /readyz returns 200. The first-party Helm chart and plain manifests already use these endpoints. The legacy /health endpoint is retained for backwards compatibility but is deprecated; treating Philter unreachability as a liveness failure is the exact failure mode the split fixes. See Monitoring -> Health Endpoints.

Are API keys hashed at rest?

Yes. Keys are hashed when the config is loaded and never held in memory as plaintext. The auth.apiKeys[].key field accepts plaintext (auto-hashed with SHA256 at load), sha256$<64-hex> for a pre-hashed value, or bcrypt$<bcrypt-hash> for users with existing bcrypt key-management workflows. Verification uses constant-time comparison. For production, pre-hash externally so the plaintext never appears in your YAML, source control, or container images. See API Key Hashing for format details, latency per algorithm, and CLI recipes for generating pre-hashed values.

To keep the secret out of the config file entirely, the key: field also accepts ${ENV_VAR} (read from an environment variable) and file:/path/to/secret (read from a mounted file) references, resolved at load and then hashed like any other value. This is the recommended way to integrate with environment-injected secrets, Kubernetes/Docker secrets, Vault, or AWS Secrets Manager. See Loading secrets from environment variables and files and the key-rotation procedure.

Can I cap how many tokens a customer uses (billing quotas)?

Yes. Enable quota to set per-key daily and monthly token caps (prompt+completion), distinct from rate limits — rate limits bound request frequency, quotas bound cumulative token spend. Set a quota.default for all keys and/or per-key overrides on auth.apiKeys[].quota. When a key reaches a window's cap, requests return 429 with a Retry-After pointing at the window reset (UTC midnight for daily, first of next UTC month for monthly). Counters can live in memory (per-replica) or Redis (shared across replicas). See Token Quotas.

How do I get per-customer usage for billing?

Enable the admin endpoint and query GET /admin/usage with the configured admin token. It returns per-key current day/month token usage plus lifetime prompt/completion totals as JSON, or CSV with ?format=csv. Keys are identified by their stable opaque ID (key-0, …), never the raw key. Usage is tracked whenever the admin endpoint or quotas are enabled. See Usage Export.

Can I cache responses to repeated prompts?

Yes. Enable cache to serve a stored response for an identical (key, model, request body) — skipping both Philter and the LLM provider, which cuts cost and latency. Only non-streaming, 2xx POST responses are cached, and the tenant key is part of the cache key so tenants never share cached entries. TTL is configurable; the backend is in-memory (per-replica) or Redis (shared). Responses carry an X-Cache: HIT|MISS header, and philter_proxy_cache_hits_total/_misses_total track the hit rate. See Response Cache.

Yes. By default the rate limiter keeps token buckets in process memory, so running N replicas behind a load balancer enforces roughly N× the configured limit (each replica only sees its own traffic). Set rateLimit.backend.type: redis to store the buckets in a shared Redis instance so all replicas enforce one consistent limit. Redis connections support authentication and TLS (including client-cert mTLS), and a configurable failure mode controls behavior when Redis is unreachable — open (default) falls back to the local in-memory limiter so traffic keeps flowing, closed rejects requests. See Shared state for multi-replica deployments.

Can I use the proxy with Mistral, Cohere, vLLM, or other OpenAI-compatible providers?

Yes. Register any OpenAI-compatible provider under providers.openaiCompatible in the config, giving it a short name and a target URL. Clients send requests to /{name}/v1/... (e.g., /mistral/v1/chat/completions); the proxy strips the prefix and forwards the standard OpenAI-format request to the configured target after running PII redaction. No changes are needed to route configuration - routes work the same way across all OpenAI-compatible providers. See Configuration for details.

How does Bedrock authentication work?

The proxy handles AWS Signature Version 4 signing internally. The client sends a plain JSON request (no AWS credentials needed). The proxy signs the modified request using credentials from the standard AWS credential chain - environment variables, EC2 instance profile, ECS task role, or IRSA - before forwarding it to Bedrock. This means you never expose AWS credentials to API clients, and access control is enforced at the IAM level on the proxy's role.

Does the proxy support streaming with Amazon Bedrock?

Yes. The /model/{modelId}/converse-stream endpoint is supported: the inbound request is redacted as usual and the AWS binary event-stream response is forwarded to the client incrementally (no buffering). As with the other providers, the streamed response body is passed through without outbound scanning. Non-streaming requests via the Converse API are also fully supported.

What happens if Philter is temporarily unavailable?

By default, the proxy retries failed Philter calls up to 3 times with exponential backoff before returning an error to the client. Only transient errors (network timeouts, HTTP 5xx responses) are retried; 4xx errors are not.

For sustained Philter unavailability, enable the circuit breaker (philter.circuitBreaker.enabled: true). Once the configured failure threshold is reached, the circuit opens and subsequent requests either receive HTTP 503 immediately (fallback: block, the default) or are forwarded unredacted with a warning log (fallback: passthrough). After the configured timeout, the circuit allows a probe request through; if it succeeds, the circuit closes.

See Configuration for retry and circuit breaker settings.

Can I bound how many concurrent requests the proxy will handle?

Yes. Set listen.maxConcurrentRequests to cap the total number of in-flight requests across the whole proxy, and/or auth.apiKeys[].maxConcurrent to cap how many concurrent requests a single API key can hold. Both caps are off by default. When either cap is reached, the proxy returns HTTP 503 with Retry-After: 1 and the JSON body {"error":{"message":"concurrency limit exceeded","type":"capacity"}} instead of queuing the request. The two metrics to watch are philter_proxy_active_requests (current utilization) and philter_proxy_concurrency_shed_total{scope} (rejections by scope). See Configuration for sizing guidance and Monitoring for the PromQL utilization recipe.

What format are the proxy's error responses in?

All errors the proxy generates itself are structured JSON with the shape:

{"error":{"message":"...","type":"...","code":"...","request_id":"..."}}

Content-Type: application/json and an X-Request-Id header carrying the same request_id are always set. The (type, code) pair is a stable enum - codes will not be removed or repurposed across minor versions. The full table lives at Configuration → Error Responses.

To trace a failed request: grab the X-Request-Id header from the response, then search audit logs for request_id=<that value>. The audit entry's error_type and error_code will match what the client received.

Errors that originate from the upstream LLM provider are forwarded through unchanged and follow the provider's own format, not the schema above.

Does the proxy track token usage?

Yes. For non-streaming responses, the proxy reads the token usage reported by the provider and includes it in the audit log (prompt_tokens, completion_tokens) and as two Prometheus counters (philter_proxy_prompt_tokens_total, philter_proxy_completion_tokens_total), both labeled by provider and model. These counters can be used to build cost-attribution dashboards in Grafana. Token counts are not available for streaming responses and are omitted from the audit log in that case. See Monitoring for PromQL examples.

Is commercial support available?

Yes, commercial support for the Philter AI Proxy and Philter is available from Philterd. Please contact us for more information.

FAQ