Configuration

The proxy is configured via a YAML configuration file. The config file is required and must be specified via --config flag or PHILTER_PROXY_CONFIG environment variable.

./philter-ai-proxy --config config.yaml
# or
PHILTER_PROXY_CONFIG=config.yaml ./philter-ai-proxy

Example Configuration

listen:
  port: 8080
  cert: cert.pem
  key: key.pem
  shutdownTimeout: 30

logging:
  enabled: true
  # file: /var/log/philter-ai-proxy/audit.log

philter:
  endpoint: https://philter.internal:8080
  tlsVerify: true
  # caCert: /etc/ssl/internal-ca.pem
  retry:
    maxAttempts: 3
    initialBackoffMs: 100
    maxBackoffMs: 2000
  # circuitBreaker:
  #   enabled: true
  #   threshold: 5
  #   timeoutSeconds: 30
  #   fallback: block

providers:
  openai:
    target: https://api.openai.com
    # tlsVerify: true
  anthropic:
    target: https://api.anthropic.com
    # tlsVerify: true
  gemini:
    target: https://generativelanguage.googleapis.com
    # tlsVerify: true
  ollama:
    target: http://localhost:11434
    # tlsVerify: true

routes:
  - match:
      header: x-philter-policy
      value: hipaa
    policy: hipaa-safe-harbor
    context: healthcare-chatbot

  - match:
      path: /v1/chat/completions
      model: gpt-4
    policy: general-purpose
    context: internal-analytics

  - match:
      model: claude-sonnet-4-20250514
    policy: code-review-policy

defaults:
  policy: default
  context: none

Configuration Compatibility

The configuration file carries an optional top-level schema version:

version: 1   # optional; defaults to the current schema when omitted

Backward-compatibility policy (the config-schema counterpart of the stable error-code contract):

Additive changes ship in any release. New optional fields with safe defaults may be added at any time. A config that is valid for version N keeps working on later releases of the same major version without edits.
version is optional and defaults to the current schema. Omitting it is fully supported, so existing configs need no changes. Setting it explicitly (version: 1) lets you pin the schema your automation was written against and get a clear startup error if a future build no longer supports it.
No silent breaking changes. Existing fields will not be removed, renamed, or have their meaning/defaults changed in a way that breaks a valid config across minor versions. Anything breaking is reserved for a major-version bump.
Unsupported version → clear startup failure. If version is set to a value this build does not understand, the proxy exits at startup (and --validate-config returns non-zero) with config: unsupported config version <n> (this build supports version <m>) ... — it never silently ignores the field.

Migration guidance. When a breaking schema change is unavoidable, the schema version is incremented, the release notes document the field-by-field migration, and both the old and new versions are accepted for at least one minor release so deployments can migrate without downtime. Validate a config against the running build before rollout with:

./philter-ai-proxy --validate-config --config config.yaml

The current schema version is 1.

Configuration Reference

`version`

Field	Type	Default	Description
`version`	int	current schema (`1`)	Optional config schema version. Omit to track the current schema, or pin it (e.g. `1`) so an unsupported future schema fails fast at startup. See Configuration Compatibility.

`listen`

Field	Type	Default	Description
`port`	int	`8080`	Port the proxy listens on
`cert`	string	`cert.pem`	Path to the TLS certificate file
`key`	string	`key.pem`	Path to the TLS private key file
`shutdownTimeout`	int	`30`	Seconds to wait for in-flight requests during graceful shutdown
`clientCA`	string	(none)	Path to a PEM CA certificate used to verify client certificates. When set, mTLS is enabled and the proxy requires a valid client certificate on every connection. See mTLS below.
`maxConcurrentRequests`	int	`0` (unlimited)	Maximum number of in-flight requests the proxy will process at once. Excess requests get HTTP 503 with `Retry-After: 1`. See Concurrency Limits below.
`maxRequestBodyBytes`	int	`10485760` (10 MiB)	Maximum inbound request body size in bytes. Larger bodies are rejected with HTTP 413. See Request Hardening below.
`maxHeaderBytes`	int	`1048576` (1 MiB)	Maximum total size of inbound request headers.
`readHeaderTimeoutMs`	int	`10000` (10s)	Time a client may take to send the request headers before the connection is dropped (slowloris mitigation).
`readTimeoutMs`	int	`0` (disabled)	Time to read the entire request including body. Bounds slow-body attacks; affects only request reads, never response streaming. Disabled by default so large/slow uploads aren't truncated.
`tlsHandshakeTimeoutMs`	int	`10000` (10s)	Time a client may take to complete the TLS handshake before the connection is dropped (slow-handshake slowloris mitigation). Independent of `readHeaderTimeoutMs`, which only starts ticking after the handshake completes. See Request Hardening below.
`maxConcurrentTLSHandshakes`	int	`16384`	Ceiling on simultaneous in-flight TLS handshakes. Bounds handshake goroutine count under a connection flood; excess connections are dropped immediately and counted by `philter_proxy_tls_handshakes_shed_total`. Established connections are unaffected. See Request Hardening below.
`trustedProxies`	string list	empty (XFF ignored)	CIDR ranges of upstream load balancers / reverse proxies whose `X-Forwarded-For` header should be honored. Empty (default) means XFF is never trusted -- the safe behavior when the proxy is exposed directly to the internet. Operators behind a trusted LB must populate this with the LB's source CIDR(s) to restore accurate per-IP rate limits and audit-log IPs. See Trusted Proxies / X-Forwarded-For.

`logging`

Field	Type	Default	Description
`enabled`	bool	`true`	Enable structured audit logging
`file`	string	(none)	Path to an additional log output file. When set, logs are written to both stdout and this file.

`metrics`

Field	Type	Default	Description
`enabled`	bool	`true`	Enable the Prometheus metrics endpoint
`port`	int	`9090`	Port for the metrics HTTP server (separate from the proxy TLS port)

See Monitoring for available metrics, PromQL examples, and Grafana dashboard setup.

`tracing`

Field	Type	Default	Description
`enabled`	bool	`false`	Initialise the OpenTelemetry SDK. With this off the proxy pays zero per-request tracing overhead.
`serviceName`	string	`philter-ai-proxy`	The OTel `service.name` resource attribute when `OTEL_SERVICE_NAME` is not set.

OTLP exporter destination, protocol, headers, sampler, and other tuning are all configured via the standard OTel env vars (OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_TRACES_SAMPLER, etc.). See Monitoring -> Distributed Tracing for the full list and worked examples.

`philter`

Field	Type	Default	Description
`endpoint`	string	`https://localhost:8080`	URL of the Philter instance
`tlsVerify`	bool	`true`	Enable TLS certificate verification for the Philter connection
`caCert`	string	(none)	Path to a custom CA certificate (PEM) for the Philter connection
`retry`	object	see below	Retry settings for failed Philter calls
`circuitBreaker`	object	see below	Circuit breaker settings for the Philter connection

`philter.retry`

Field	Type	Default	Description
`maxAttempts`	int	`3`	Total number of attempts (1 = no retry). Only transient errors (network errors, HTTP 5xx) are retried.
`initialBackoffMs`	int	`100`	Initial backoff delay in milliseconds before the first retry
`maxBackoffMs`	int	`2000`	Maximum backoff delay in milliseconds (backoff is capped at this value)

`philter.circuitBreaker`

Field	Type	Default	Description
`enabled`	bool	`false`	Enable the circuit breaker for the Philter connection
`threshold`	int	`5`	Number of consecutive failures before the circuit opens
`timeoutSeconds`	int	`30`	Seconds the circuit remains open before allowing a probe request (half-open state)
`fallback`	string	`block`	Action when the circuit is open: `block` (return HTTP 503) or `passthrough` (forward the request unredacted with a warning log)

`providers`

Each of the standard providers (openai, anthropic, gemini, ollama) accepts:

Field	Type	Default	Description
`target`	string	(provider default)	Target URL for the provider
`tlsVerify`	bool	`true`	Enable TLS certificate verification for this provider
`timeouts`	object	(see Provider Timeouts)	Per-provider HTTP timeouts

Default provider targets:

openai: https://api.openai.com
anthropic: https://api.anthropic.com
gemini: https://generativelanguage.googleapis.com
ollama: http://localhost:11434

`providers.openaiCompatible`

Any number of additional OpenAI-compatible providers (Mistral, Cohere, vLLM, LM Studio, etc.) can be registered under providers.openaiCompatible. Each entry maps a short name to a target URL.

providers:
  openaiCompatible:
    mistral:
      target: https://api.mistral.ai
    cohere:
      target: https://api.cohere.com
    vllm:
      target: http://vllm.internal:8000

Clients send requests to /{name}/v1/... - the proxy strips the prefix and forwards the remainder to the configured target using the same OpenAI handler logic. For example, a request to /mistral/v1/chat/completions is forwarded to https://api.mistral.ai/v1/chat/completions. The provider label in the audit log is set to the registered name.

Each entry accepts:

Field	Type	Default	Description
`target`	string	- (required)	Base URL for this provider
`tlsVerify`	bool	`true`	Enable TLS certificate verification for this provider

Reserved names: v1, api, model, and health conflict with built-in route prefixes and will be rejected at startup.

`providers.bedrock`

Amazon Bedrock is an optional provider. It is enabled by setting providers.bedrock.region. When enabled, the proxy accepts requests matching /model/{modelId}/converse and /model/{modelId}/converse-stream and forwards them to https://bedrock-runtime.{region}.amazonaws.com using AWS Signature Version 4 authentication. ConverseStream responses (AWS binary event-stream) are forwarded to the client incrementally without buffering.

Field	Type	Default	Description
`region`	string	(none - Bedrock disabled)	AWS region for the Bedrock runtime endpoint (e.g., `us-east-1`)
`roleArn`	string	(none)	ARN of an IAM role to assume for Bedrock calls (e.g., `arn:aws:iam::123456789012:role/BedrockRole`). When set, the proxy calls `sts:AssumeRole` using the host's base credentials and signs Bedrock requests with the resulting session credentials.
`tlsVerify`	bool	`true`	Enable TLS certificate verification for the Bedrock connection

Authentication: The proxy uses the standard AWS credential chain. No AWS credentials need to be supplied by the client. The recommended deployment pattern is to attach an IAM role to the compute resource running the proxy (EC2 instance profile, ECS task role, Kubernetes service account with IRSA) and grant that role the bedrock:InvokeModel permission. Environment variable credentials (AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY) are also supported for development.

If the host credentials do not have Bedrock access directly (e.g., in a multi-account setup), set roleArn to an IAM role ARN that the proxy should assume. The proxy will call sts:AssumeRole with the host's base credentials and use the resulting session credentials to sign Bedrock requests. The host role must have sts:AssumeRole permission on the target role.

Minimum IAM policy:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": "bedrock:InvokeModel",
    "Resource": "arn:aws:bedrock:us-east-1::foundation-model/*"
  }]
}

Supported models: Any model available through the Bedrock Converse API in the configured region, including Anthropic Claude, Amazon Titan, Meta Llama, Mistral, and Cohere models.

Streaming: The converseStream endpoint is not yet supported. Streaming support is planned for a future release.

`providers.azure`

Azure OpenAI is an optional first-class provider. It is enabled by setting providers.azure.target to your resource endpoint. Azure uses deployment-based routing rather than OpenAI's model-in-body convention: the proxy routes any request whose path begins with /openai/deployments/{deployment}/, preserves the path and the api-version query parameter, and forwards it to the configured Azure endpoint. Request and response bodies are OpenAI-compatible, so inbound redaction and token-usage accounting are identical to the OpenAI provider.

!!! note "Redaction scope" Inbound redaction covers the text-bearing fields of all JSON endpoints the proxy understands — see the Redacted Fields table for the full per-endpoint list. Multipart/binary uploads (file uploads, audio transcriptions, image edits) are not supported: the proxy expects a JSON body and rejects multipart requests with 400 invalid_request.

providers:
  azure:
    target: https://my-resource.openai.azure.com
    apiVersion: "2024-02-01"   # optional: injected when a request omits api-version
    entraID: false             # false (default) = pass the client's api-key header through
    # tlsVerify: true

Field	Type	Default	Description
`target`	string	(none - Azure disabled)	Azure OpenAI resource endpoint, e.g. `https://my-resource.openai.azure.com`.
`apiVersion`	string	(none)	Default `api-version` injected when a request doesn't supply one. Azure requires this parameter; setting it here lets clients that omit it still work. A client-supplied `api-version` always takes precedence.
`entraID`	bool	`false`	When `true`, the proxy authenticates to Azure with an Azure AD / Entra ID bearer token instead of passing the client's `api-key` through.
`tlsVerify`	bool	`true`	Enable TLS certificate verification for the Azure connection.

Authentication — two modes:

api-key pass-through (default). The client sends its Azure api-key header; the proxy forwards it unchanged (the same way it passes through Authorization for OpenAI). No proxy-side credentials are needed.
Entra ID (entraID: true). The proxy acquires a token via the default Azure credential chain — managed identity, workload identity, or environment credentials (AZURE_CLIENT_ID / AZURE_TENANT_ID / AZURE_CLIENT_SECRET) — caches it until shortly before expiry, and sets it as the Authorization: Bearer header on outbound requests (scope https://cognitiveservices.azure.com/.default). The recommended production pattern is a workload identity / managed identity assigned the Cognitive Services OpenAI User role on the resource, so no secrets are handled by clients. A token-acquisition failure returns 502 (provider_error / azure_auth_failed).

Client example (api-key mode):

curl -k "https://localhost:8080/openai/deployments/gpt-4o/chat/completions?api-version=2024-06-01" \
  -H "api-key: $AZURE_OPENAI_KEY" \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Hello"}]}'

Note Azure encodes the model in the deployment name (URL), so the request body's model field is optional; the audit log records whatever the body supplies. When the proxy is not configured for Azure, /openai/deployments/... requests return 404 (not_found / azure_disabled).

`providers.vertex`

Vertex AI (Gemini on Google Cloud) is an optional first-class provider. It is enabled by setting providers.vertex.project (and typically providers.vertex.location). Vertex's API surface differs from the public Gemini API:

Regional endpoint. Requests go to https://{location}-aiplatform.googleapis.com (e.g. us-central1-aiplatform.googleapis.com), not generativelanguage.googleapis.com. The proxy derives this from location, or you can override it with endpoint.
Resource-style paths. /v1/projects/{project}/locations/{location}/publishers/google/models/{model}:generateContent and the streaming variant :streamGenerateContent. The proxy routes any request whose path matches this shape; the path is preserved verbatim when forwarding (so {project} and {location} in the URL need not equal the configured values, useful if the proxy fronts multiple projects whose ADC is permitted).
OAuth2 / ADC authentication. No ?key= query parameter; the proxy acquires a Google access token via Application Default Credentials and sets it as the Authorization: Bearer header on outbound requests. The cached token is refreshed shortly before expiry.

Request and response bodies are the same Gemini schema as the public provider, so inbound redaction and outbound scanning are identical to the public Gemini provider.

providers:
  vertex:
    project: my-gcp-project
    location: us-central1
    # endpoint: https://override.example.com   # optional: override the default regional endpoint
    # tlsVerify: true

Field	Type	Default	Description
`project`	string	(none - Vertex disabled)	GCP project ID. Setting this enables the Vertex provider.
`location`	string	(none)	Region used to build the default endpoint (e.g. `us-central1`). Required unless `endpoint` is set.
`endpoint`	string	derived from `location`	Override the target URL. Useful for VPC-SC private endpoints or local-emulator testing.
`tlsVerify`	bool	`true`	Enable TLS certificate verification for the Vertex connection.
`timeouts`	object	(proxy defaults)	Per-provider HTTP timeouts.

Authentication. The proxy uses Application Default Credentials (workload identity, service-account key, the metadata server on GCE/GKE/Cloud Run, gcloud auth application-default login, etc.). The recommended production pattern is a workload identity bound to a service account with the Vertex AI User role on the project. The OAuth2 scope used is https://www.googleapis.com/auth/cloud-platform. A token-acquisition failure returns 502 (provider_error / vertex_auth_failed).

Client example.

curl -k "https://localhost:8080/v1/projects/my-gcp-project/locations/us-central1/publishers/google/models/gemini-1.5-pro:generateContent" \
  -H "Content-Type: application/json" \
  -d '{"contents":[{"parts":[{"text":"Hello"}]}]}'

The client does not send any credentials -- the proxy attaches the bearer token.

Streaming. Vertex's :streamGenerateContent endpoint returns one of two shapes depending on the request:

With ?alt=sse -- Vertex emits a true SSE stream (Content-Type: text/event-stream). The proxy detects this and passes chunks through to the client as they arrive, without buffering.
Without ?alt=sse (default) -- Vertex returns a single application/json array containing all generation chunks. The proxy treats this as a regular non-streaming response: the body is buffered, redacted (when outbound scanning is on), and forwarded in one shot. This is correct behavior given the shape Vertex returns; it just is not "streaming" end-to-end.

If you want token-by-token streaming end to end, your client must add ?alt=sse to the request URL; the proxy forwards query parameters verbatim.

Audit log. The model in a Vertex request is identified by the URL (/models/{model}), not the request body. The proxy extracts the model from the path and records it in the audit entry's model field; provider is vertex. When the proxy is not configured for Vertex, requests to /v1/projects/.../models/...:generateContent return 404 (not_found / vertex_disabled).

`routes`

Routes control which Philter redaction policy and context are applied to each request. They do not control which LLM provider handles the request - provider routing is determined automatically by the URL path (see API Reference for path-to-provider mapping).

This means a single route can apply across all providers. For example, a route matching the header x-philter-policy: hipaa will use the HIPAA policy whether the request is going to OpenAI, Anthropic, Gemini, or Ollama.

Routes are evaluated in order; the first match wins. If no route matches, the defaults are used.

Each route has a match block with one or more criteria (all specified criteria must match):

Criterion	Description
`header` + `value`	Matches when the request contains the specified header with the specified value
`path`	Matches when the request URL path equals this value
`model`	Matches when the model name in the request body equals this value

Each route specifies:

Field	Required	Description
`policy`	Yes	Philter policy name to use for redaction
`context`	No	Philter context to use (falls back to `defaults.context` if not set)
`outbound`	No	Outbound response scanning settings for this route (see below)

`defaults`

Field	Type	Default	Description
`policy`	string	`default`	Philter policy used when no route matches
`context`	string	`none`	Philter context used when no route matches (or when a matched route has no context)
`outbound`	object	(disabled)	Default outbound scanning settings (see below)

`outbound`

Outbound response scanning runs the LLM's response through Philter before it is returned to the client. It is disabled by default and must be explicitly enabled. When enabled, the same Philter policy, context, and document ID used for inbound redaction are reused, so Philter can correlate the request/response pair.

Latency note: outbound scanning buffers the full provider response before returning it, adding the round-trip latency of the Philter call. For latency-sensitive workloads, consider enabling outbound scanning only on routes where compliance requires it.

Streaming note: outbound scanning is skipped automatically when the provider returns a streaming response (text/event-stream or application/x-ndjson). The response is passed through to the client unchanged, and a warning is logged.

Field	Type	Default	Description
`enabled`	bool	`false`	Enable outbound response scanning
`action`	string	`redact`	Action when PII is detected: `redact`, `block`, or `flag`

Actions:

Action	Behaviour
`redact`	Detected PII is replaced with Philter's configured replacement token before the response is returned (default).
`block`	If any PII is detected, the response is suppressed and the client receives HTTP `403` with `{"error":{"message":"response blocked: PII detected","type":"pii_blocked"}}`.
`flag`	PII is detected and logged as a warning, but the original unmodified response is returned to the client.

Example - block responses containing PII for HIPAA routes:

routes:
  - match:
      header: x-philter-policy
      value: hipaa
    policy: hipaa-safe-harbor
    context: healthcare-chatbot
    outbound:
      enabled: true
      action: block

defaults:
  policy: default
  context: none
  outbound:
    enabled: false

Rate Limiting

Rate limiting is disabled by default. When enabled, the proxy enforces per-client request rate limits using the token bucket algorithm. The client identifier is the API key (when auth is enabled) or the client IP address (when auth is disabled).

Configuration

rateLimit:
  enabled: true
  requestsPerSecond: 10.0   # per-client sustained rate
  burst: 20                 # maximum burst size above the sustained rate
  global:                   # optional: hard cap across all clients combined
    requestsPerSecond: 100.0
    burst: 200

Per-key overrides are configured on the API key entry:

auth:
  apiKeys:
    - key: standard-team-key
    - key: high-volume-service-key
      rateLimit:
        requestsPerSecond: 50.0   # this key gets a higher limit
        burst: 100

`rateLimit` reference

Field	Type	Default	Description
`enabled`	bool	`false`	Enable rate limiting. When false all other fields are ignored.
`requestsPerSecond`	float	- (required when enabled)	Sustained per-client request rate (requests per second)
`burst`	int	- (required when enabled)	Maximum number of requests a client may send in a burst above the sustained rate. Must be ≥ 1.
`global.requestsPerSecond`	float	`0` (disabled)	Global sustained rate across all clients combined. `0` disables the global backstop.
`global.burst`	int	`0` (disabled)	Global burst size. Must be set alongside `global.requestsPerSecond` to enable the global limit.
`backend`	object	`memory`	Where token-bucket state lives. Use `redis` to share state across replicas. See Shared state for multi-replica deployments.

Per-key rate limit overrides (auth.apiKeys[].rateLimit) accept the same requestsPerSecond and burst fields and take precedence over the global defaults for that key.

Shared state for multi-replica deployments

By default, token-bucket state lives in process memory (backend.type: memory). This is correct for a single replica, but running N replicas behind a load balancer multiplies the effective limit by N — each replica counts only the requests it sees. To enforce one consistent limit across all replicas, point the limiter at a shared Redis backend:

rateLimit:
  enabled: true
  requestsPerSecond: 100.0
  burst: 200
  backend:
    type: redis              # default: memory
    failureMode: open        # "open" (default) or "closed" — see below
    redis:
      address: redis.internal:6379
      username: philter       # optional (Redis 6+ ACL)
      password: ${REDIS_PASSWORD}   # supports ${ENV_VAR} / file: references
      db: 0
      keyPrefix: "philter:rl:"      # optional namespace
      timeoutMs: 100                # per-call timeout
      tls:
        enabled: true
        caCert: /etc/ssl/redis-ca.pem        # optional custom CA
        cert: /etc/ssl/redis-client.pem      # optional client cert (mTLS)
        key: /etc/ssl/redis-client-key.pem
        # insecureSkipVerify: true           # development only

The Redis backend implements an atomic token bucket in a server-side Lua script (a single round-trip per decision) and uses the Redis server clock, so replicas with skewed clocks still agree. The same per-client and global buckets described above apply — they are simply stored in Redis instead of process memory.

Failure mode when Redis is unreachable (backend.failureMode):

Mode	Behaviour when the backend errors or times out
`open` (default)	Fail open — degrade to the local in-memory limiter so traffic keeps flowing, still bounded per-replica. Availability is preserved at the cost of temporarily enforcing per-replica rather than global limits.
`closed`	Fail closed — reject requests with `429` while the backend is down. Choose this when exceeding the limit is worse than dropping traffic.

The local-memory limiter is always retained and is used as the fail-open fallback, so a Redis outage never takes the proxy down.

Backend health is observable via Prometheus metrics: philter_proxy_ratelimit_backend_duration_seconds (call latency, labeled by backend and ok/error result), philter_proxy_ratelimit_backend_errors_total (backend error count), and philter_proxy_ratelimit_fallback_total (decisions that fell back to local memory). See Monitoring.

`rateLimit.backend` reference

Field	Type	Default	Description
`type`	string	`memory`	`memory` (per-replica, in-process) or `redis` (shared across replicas).
`failureMode`	string	`open`	Behaviour when the redis backend is unreachable: `open` (fall back to local memory) or `closed` (reject).
`redis.address`	string	- (required for redis)	Redis endpoint, `host:port`.
`redis.username`	string	(none)	Redis ACL username (Redis 6+).
`redis.password`	string	(none)	Redis password. Accepts `${ENV_VAR}` / `file:` secret references.
`redis.db`	int	`0`	Logical database number.
`redis.keyPrefix`	string	`philter:rl:`	Namespace prefix for the proxy's keys.
`redis.timeoutMs`	int	`100`	Per-call Redis timeout in milliseconds. On timeout the failure mode applies.
`redis.tls.enabled`	bool	`false`	Connect to Redis over TLS.
`redis.tls.caCert`	string	(system roots)	PEM CA bundle for verifying the Redis server certificate.
`redis.tls.cert` / `redis.tls.key`	string	(none)	Client certificate + key for mutual TLS to Redis. Both required together.
`redis.tls.insecureSkipVerify`	bool	`false`	Skip server certificate verification (development only).

Behaviour when the limit is exceeded

When a client exceeds its limit the proxy returns HTTP 429 Too Many Requests with:

Content-Type: application/json
Retry-After: <seconds> header indicating when the client may retry
JSON body: {"error":{"message":"rate limit exceeded","type":"rate_limit_error"}}

A structured warning is logged with the client identifier:

{"time":"...","level":"WARN","msg":"Rate limit exceeded","client":"api-key-or-ip"}

Client identification

Auth state	Client ID used
Auth enabled, valid key	The API key value
Auth disabled	Client IP address (supports `X-Forwarded-For`)

The global backstop, when configured, is checked before the per-client limit and applies regardless of which client is making the request.

Authentication

Authentication is disabled by default. The proxy accepts requests from any client with no credentials required. This is appropriate for simple deployments where network-level controls (firewall, VPC, service mesh) are sufficient. Enable authentication for environments where multiple teams or services share a proxy instance, or where access needs to be scoped per client.

API Key Authentication

Configure a list of API keys in the auth section. Each key can optionally be bound to a specific Philter policy.

auth:
  header: x-philter-proxy-key   # optional - this is the default
  apiKeys:
    - key: secret-key-for-team-a
    - key: secret-key-for-healthcare
      policy: hipaa-safe-harbor   # this key always uses the HIPAA policy

Clients include the key in the configured request header:

curl -k https://localhost:8080/v1/chat/completions \
  -H "x-philter-proxy-key: secret-key-for-team-a" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4","messages":[{"role":"user","content":"Hello"}]}'

Behaviour:

Scenario	Result
Valid key, no policy binding	Request proceeds; policy resolved by route matching as normal
Valid key with policy binding	Request proceeds; the key's policy overrides the matched route policy
Missing header	`401 Unauthorized` with JSON error body
Invalid key value	`401 Unauthorized` with JSON error body
No keys configured	All requests pass (auth disabled)

The proxy's auth header is always stripped before forwarding. The LLM provider never sees x-philter-proxy-key. The provider's own credentials (Authorization: Bearer ..., x-api-key, etc.) pass through unchanged.

`auth` reference

Field	Type	Default	Description
`header`	string	`x-philter-proxy-key`	Request header the proxy reads the API key from
`apiKeys`	list	(none - auth disabled)	List of valid API keys

`auth.apiKeys[]` entry

Field	Type	Required	Description
`key`	string	Yes	The API key value. Accepts plaintext, a pre-hashed value (see Hashing), or a `${ENV_VAR}` / `file:` secret reference (see Loading secrets from environment variables and files).
`id`	string	No	Strongly recommended. Stable opaque identifier used as the rate-limit / concurrency / quota / cache-tenant / audit-log `key_id`. Falls back to the legacy positional `key-N` when unset, which is fragile across `apiKeys` reorders. See Per-key Stable Identifiers.
`policy`	string	No	Philter policy to enforce for all requests authenticated with this key. Overrides route and default policy.
`rateLimit`	object	No	Per-key rate-limit override. See Rate Limiting.
`maxConcurrent`	int	No	Per-key in-flight concurrency cap (0 = unlimited). Applied in addition to the global `listen.maxConcurrentRequests` cap. See Concurrency Limits.
`quota`	object	No	Per-key token-quota override (daily/monthly). See Token Quotas.
`scopes`	object	No	Per-key allow-lists for providers, models, and request paths. Empty / unset means full access (backwards compatible). See Per-key Authorization.
`adminRole`	string	No	Optional scoped admin role for this key. Currently the only value is `usage-read`, which lets this key call `GET /admin/usage` without the full admin token. Empty (default) means no admin access. See Admin Roles.

Per-key Authorization (scopes)

By default, a configured API key may call any provider, model, and request path the proxy supports. Multi-tenant deployments often want to constrain individual keys: a tenant's key should only call the providers / models / endpoints that tenant is paying for, and nothing else.

auth.apiKeys[].scopes declares per-key allow-lists. Empty or unset is full access (the existing behavior); a non-empty list on any axis is deny-by-default for that axis.

auth:
  apiKeys:
    - key: team-a-key
      scopes:
        providers: [openai, anthropic]   # team A can use OpenAI and Anthropic
        models: ["gpt-4*", "claude-3*"]  # only these model families
        paths: ["/v1/"]                  # everything under /v1/, nothing else
    - key: team-b-key
      scopes:
        providers: [bedrock]             # team B is Bedrock-only
        # models / paths empty -> any model / path on bedrock
    - key: legacy-key
      # no scopes block at all -> unrestricted (backwards compat)

Field	Type	Default	Matching
`providers`	string list	empty (allow all)	Exact match against the resolved provider name: `openai`, `anthropic`, `gemini`, `ollama`, `azure`, `bedrock`, `vertex`, or a configured `openaiCompatible[].name`. A trailing `*` on an entry makes it a prefix match.
`models`	string list	empty (allow all)	Exact match against the request's `model` field, or trailing-`` glob (e.g. `gpt-4`). When set, requests with no model field are denied.
`paths`	string list	empty (allow all)	Prefix match against the request path after any `openaiCompatible[]` provider prefix has been stripped.

A request must satisfy every non-empty axis (logical AND across axes; logical OR within each axis). Denied requests receive HTTP 403 with one of:

`error.type`	`error.code`	When
`forbidden`	`scope_denied_provider`	Resolved provider not in the key's `providers` allow-list.
`forbidden`	`scope_denied_model`	Request `model` not in the key's `models` allow-list (or no `model` set when the allow-list is configured).
`forbidden`	`scope_denied_path`	Request path not in any of the key's `paths` prefix entries.

The denial is mirrored in the audit log: the key_id, provider, model, error_type, and error_code fields all appear on the inbound audit entry with http_status: 403, so a denied call is fully traceable by request_id without ever exposing the raw key. See Error Responses for the full client-error contract.

Admin Roles

The GET /admin/usage endpoint is gated by either:

The full admin token (admin.token), sent in the configured admin header (default x-philter-admin-token). This is the existing all-or-nothing credential and remains unchanged.
An API key with adminRole: usage-read, sent in the regular auth header (default x-philter-proxy-key). This is a scoped read-only role for billing or reporting clients that should be able to read usage but not act as a full admin or make LLM calls outside their own scopes.

admin:
  enabled: true
  token: ${ADMIN_TOKEN}

auth:
  apiKeys:
    - key: ${BILLING_READER_KEY}
      adminRole: usage-read   # this key can read /admin/usage, nothing else admin-y

adminRole is independent of scopes: the role grants admin-API access only, while scopes restricts the proxy's normal LLM-call surface. A successful admin export logs auth_mode=admin_token or auth_mode=api_key_usage_read plus the opaque key_id for the latter, so operators can distinguish the two paths in audit trails.

API Key Hashing

API keys are hashed at load and never stored in memory as plaintext. The in-memory keyStore holds only hashes; verification uses constant-time comparison. This protects against accidental disclosure via heap dumps, debug prints, or core files.

Three input formats are accepted in the key: field:

Format	Example	When to use
Plaintext	`key: SuperSecretAPIKey123`	Quickstart. The proxy hashes the value with SHA256 at load. The plaintext is in your YAML file, so keep the file out of source control.
`sha256$<64-hex>`	`key: sha256$e3b0c44...`	Production. Pre-hash externally, put the hash in YAML. The plaintext never sits in version control or the running config.
`bcrypt$<bcrypt-hash>`	`key: bcrypt$$2a$10$N9qo8...`	For users with existing bcrypt-based key management or compliance requirements. Slower (see latency table).

Why SHA256 by default. API keys are typically high-entropy random tokens (32+ random bytes). The threat model for hashing-at-rest is "an attacker who reaches a memory dump should not be able to recover live credentials." Brute-forcing 256 bits of entropy is infeasible, so a fast hash with constant-time comparison provides adequate protection. The slow-hash family (bcrypt, argon2id) is designed for low-entropy human passwords; for random API keys it adds latency without commensurate security gain.

Generating pre-hashed values. For SHA256:

printf '%s' 'SuperSecretAPIKey123' | sha256sum | awk '{print "sha256$" $1}'
# sha256$2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae

For bcrypt (using Python; any bcrypt CLI works):

python3 -c "import bcrypt; print('bcrypt$' + bcrypt.hashpw(b'SuperSecretAPIKey123', bcrypt.gensalt(10)).decode())"
# bcrypt$$2b$10$Hkpz7C0vQp...

Per-request latency. Approximate cost of one auth check on a modern x86 server. The proxy iterates all configured keys and verifies the supplied key against each; total latency scales with the number of configured entries.

Format	Per-entry cost	10 entries	Notes
`sha256`	~1-2 µs	~20 µs	Recommended default. Negligible for any realistic QPS.
`bcrypt` cost=4	~1-2 ms	~10-20 ms	bcrypt minimum cost; faster than the default but still meaningful.
`bcrypt` cost=10	~60-100 ms	~600 ms-1 s	bcrypt default cost. Avoid at high QPS - this will dominate your request latency.
`bcrypt` cost=12	~250-400 ms	several seconds	bcrypt's "recommended" password cost. Not appropriate for API keys.

Recommendations:

Default (SHA256): no tuning needed.
bcrypt: pick the lowest cost your compliance requirements allow. cost=4 is appropriate for high-throughput API key use.

Per-key features (rate-limit, concurrency). The proxy assigns each auth.apiKeys[] entry an opaque stable identifier. Per-key rate-limit, per-key concurrency, the response-cache tenant prefix, the usage store, and audit-log key_id are all keyed by this identifier, so the raw API key never has to reach those subsystems. See Per-key Stable Identifiers for the explicit id: field (strongly recommended) and the legacy positional fallback (key-0, key-1, ...).

Per-key Stable Identifiers

Each auth.apiKeys[] entry can declare an explicit id: field, which is used as its stable opaque identifier wherever the proxy needs one (rate-limit bucket, concurrency bucket, quota counter, cache tenant prefix, audit log key_id, usage export row):

auth:
  apiKeys:
    - key: ${TEAM_A_KEY}
      id: team-a                 # explicit
    - key: ${TEAM_B_KEY}
      id: team-b
    - key: ${BILLING_READER_KEY}
      id: billing-reader
      adminRole: usage-read

When id: is omitted the proxy falls back to the legacy positional identifier (key-0, key-1, ... derived from the entry's position in the list). Setting id: explicitly is strongly recommended because the positional fallback is fragile: inserting a new entry at the top of the list, removing a middle entry, or even reordering for readability re-shuffles which key owns which historical state. With a response cache enabled, that re-shuffle is a real cross-tenant data leak -- the new key-0 would inherit the old key-0's cached responses.

Validation:

Each explicit id: must be unique across auth.apiKeys.
Explicit id: values must not start with the reserved prefix key- (which would collide with the legacy positional scheme).

Migrating from positional IDs: add id: to each entry, choosing a stable label (team-a, billing-reader, etc.). The transition is opt-in -- entries without id: continue to receive their positional identifier so existing rate-limit / quota / cache state is not invalidated mid-flight.

Loading secrets from environment variables and files

Storing API keys as plaintext in config.yaml means they end up in version control or baked into container images. To keep secrets out of the config file, the key: field accepts two reference syntaxes in addition to literal values:

Syntax	Example	Resolves to
`${ENV_VAR}`	`key: ${TEAM_A_KEY}`	The value of the `TEAM_A_KEY` environment variable
`file:<path>`	`key: file:/run/secrets/team-a`	The contents of `/run/secrets/team-a` (trailing whitespace/newline trimmed)
literal	`key: secret-key-for-team-a`	Used verbatim (backwards-compatible)

auth:
  apiKeys:
    - key: ${TEAM_A_KEY}                  # from environment variable
    - key: file:/run/secrets/healthcare   # from a mounted file
      policy: hipaa-safe-harbor
    - key: secret-key-legacy              # plaintext literal still works

References are resolved once, at config load, before validation and hashing. The resolved value then flows through the same hashing path as a literal — so a plaintext secret loaded from a file or env var is still SHA256-hashed in memory and never retained as plaintext beyond load.

This is the recommended way to integrate with external secret stores:

Kubernetes / Docker secrets — mount the secret as a file and reference it with file:/run/secrets/....
HashiCorp Vault, AWS Secrets Manager, etc. — have your init container or entrypoint export the secret into an environment variable (e.g. vault read, aws secretsmanager get-secret-value) and reference it with ${VAR}.

Error handling. If a referenced environment variable is unset or empty, or a referenced file is missing or empty, the proxy fails to start with a clear error that names the variable or path. Validation and resolution errors never echo the secret value itself — only the reference (env var name or file path), which is not sensitive.

This syntax is implemented by a generic resolver (resolveSecret) and is intended to apply to any future secret-bearing config field (such as provider auth headers), not just auth.apiKeys[].key.

Rotating API keys

Because secrets are resolved at config load, rotation follows the lifecycle of the underlying env var / file plus a config reload:

Issue the new key in your secret store (Vault, Secrets Manager, Kubernetes Secret, etc.).
Add it alongside the old one as a second auth.apiKeys[] entry so both are valid during the cutover window (zero-downtime). For example, mount the new secret at file:/run/secrets/team-a-next and add a second entry referencing it.
Reload the proxy so it re-resolves the references and picks up the new value:
- The proxy currently re-reads its config (and therefore re-resolves ${ENV_VAR} / file: references) on process restart. In Kubernetes, trigger a rolling restart (kubectl rollout restart deployment/philter-ai-proxy) — updated Secret/env values are picked up by the new pods with no dropped connections.
- (Planned: in-place reload on SIGHUP so a running process can re-resolve secrets without a restart. Until that ships, use a rolling restart.)
Migrate clients to the new key.
Remove the old entry and revoke the old secret in your store, then reload again.

Because the value lives in the secret store rather than the YAML, rotation does not require editing and re-committing config.yaml.

mTLS (Mutual TLS)

For service-to-service authentication in zero-trust environments, the proxy can require clients to present a valid TLS certificate signed by a configured CA. Set listen.clientCA to the path of the PEM-encoded CA certificate:

listen:
  port: 8080
  cert: cert.pem
  key: key.pem
  clientCA: /etc/ssl/client-ca.pem

When clientCA is set, the proxy configures RequireAndVerifyClientCert on its TLS listener. Any connection without a valid client certificate is rejected at the TLS handshake level, before any HTTP processing occurs.

mTLS and API key authentication are orthogonal - either or both can be enabled simultaneously. A typical defence-in-depth configuration uses mTLS to authenticate the connection and API keys to scope policy access per team.

Generating a test client certificate:

# CA key and cert (one-time setup)
openssl req -newkey rsa:4096 -keyout ca.key -x509 -days 3650 -out ca.crt -subj "/CN=My Proxy CA"

# Client key and CSR
openssl req -newkey rsa:2048 -keyout client.key -out client.csr -subj "/CN=my-service"

# Sign the client cert with the CA
openssl x509 -req -in client.csr -CA ca.crt -CAkey ca.key -CAcreateserial -out client.crt -days 365

Set listen.clientCA: ca.crt in the proxy config, then pass --cert client.crt --key client.key to curl (or configure the equivalent in your HTTP client).

Audit Logging

Every proxy request produces a structured JSON log entry (JSONL) to stdout. All output from the proxy - audit entries, startup, shutdown, and errors - is structured JSON, making it safe to pipe directly into log aggregators.

Log Schema

Field	Type	Description
`time`	string	ISO 8601 timestamp
`request_id`	string	Unique ID for request correlation
`direction`	string	Scan direction: `inbound` (request) or `outbound` (response, when outbound scanning is enabled)
`provider`	string	LLM provider (`openai`, `anthropic`, `gemini`, `ollama`, `azure`, `bedrock`, `vertex`, or an `openaiCompatible[].name`)
`model`	string	Model name from the request body
`policy_name`	string	Philter policy used for redaction
`document_id`	string	Philter document ID (correlates with Philter's own logs)
`fields_redacted`	int	Number of text fields sent through Philter
`entity_count`	int	Total number of entities detected and redacted
`entity_types`	string[]	Distinct entity types detected (e.g., `["NER_ENTITY", "SSN"]`)
`redact_latency_ms`	int	Total time spent on Philter redaction calls (milliseconds)
`client_ip`	string	Client IP address (supports `X-Forwarded-For`)
`key_id`	string	Opaque stable identifier (`key-N`) of the authenticated API key, or empty when no key was authenticated. Never the raw key. Use this to correlate per-key authorization decisions (including scope denials) end-to-end.
`http_status`	int	HTTP status code of the upstream provider response
`prompt_tokens`	int	Prompt (input) token count reported by the provider. Omitted for streaming responses and when the provider does not return usage data.
`completion_tokens`	int	Completion (output) token count reported by the provider. Omitted under the same conditions as `prompt_tokens`.
`error_type`	string	The `error.type` value the client received. Empty on 2xx responses. See Error Responses.
`error_code`	string	The `error.code` value the client received. Empty on 2xx responses. See Error Responses.
`trace_id`	string	W3C trace ID, when OpenTelemetry tracing is enabled and the request was sampled. Use it to cross-reference audit log entries with traces in your APM. See Distributed Tracing.

Example Log Entries

When outbound scanning is disabled (default), one entry is emitted per request:

{"time":"2026-01-15T10:30:00Z","level":"INFO","msg":"request","request_id":"a1b2c3d4","direction":"inbound","provider":"openai","model":"gpt-4","policy_name":"default","document_id":"doc-789","fields_redacted":2,"entity_count":3,"entity_types":["NER_ENTITY","SSN"],"redact_latency_ms":45,"client_ip":"10.0.0.1","http_status":200,"prompt_tokens":312,"completion_tokens":87}

When outbound scanning is enabled, two entries are emitted per request - one for the inbound scan and one for the outbound scan. Both share the same request_id and document_id for correlation. Token counts appear on the inbound entry only:

{"time":"2026-01-15T10:30:00Z","level":"INFO","msg":"request","request_id":"a1b2c3d4","direction":"outbound","provider":"openai","model":"gpt-4","policy_name":"default","document_id":"doc-789","fields_redacted":1,"entity_count":1,"entity_types":["NER_ENTITY"],"redact_latency_ms":12,"client_ip":"10.0.0.1","http_status":200}
{"time":"2026-01-15T10:30:00Z","level":"INFO","msg":"request","request_id":"a1b2c3d4","direction":"inbound","provider":"openai","model":"gpt-4","policy_name":"default","document_id":"doc-789","fields_redacted":2,"entity_count":3,"entity_types":["NER_ENTITY","SSN"],"redact_latency_ms":45,"client_ip":"10.0.0.1","http_status":200,"prompt_tokens":312,"completion_tokens":87}

SIEM Integration

The proxy outputs one JSON object per line (JSONL) to stdout, which is the standard format for container-based log collection. Common integrations:

Fluentd / Fluent Bit: Use the tail input plugin pointed at the container's stdout, or the forward input with Docker's fluentd log driver. No parsing configuration is needed since the output is already JSON.
Promtail / Loki: Configure a docker or journal source. Use the json pipeline stage to extract fields for label-based querying.
Splunk: Use the Splunk Connect for Kubernetes or the HTTP Event Collector (HEC) with sourcetype=_json.
Elastic (Filebeat): Use the container or log input with json.keys_under_root: true and json.add_error_key: true.
AWS CloudWatch: Container stdout is captured automatically with ECS or EKS. Use CloudWatch Logs Insights to query JSON fields directly.

For file-based collection (non-containerized deployments), set logging.file in the config and point your collector at that path.

Streaming

The proxy supports streaming responses (stream: true) for all four providers:

OpenAI: Server-Sent Events (SSE) with data: prefixed chunks
Anthropic: SSE with event: / data: chunks
Gemini: Chunked JSON via streamGenerateContent
Ollama: Newline-delimited JSON (streaming is the default)

Streaming requires no additional configuration. Inbound prompt redaction works identically for streaming and non-streaming requests. Response chunks are forwarded to the client in real time without buffering.

TLS Configuration

By default, TLS certificate verification is enabled for all outbound connections (both to the Philter backend and to LLM providers). This is the recommended configuration for production deployments.

Philter Backend with Self-Signed Certificate

If your Philter instance uses a self-signed certificate or a certificate from an internal CA, provide the CA certificate in the config:

philter:
  endpoint: https://philter.internal:8080
  caCert: /etc/ssl/internal-ca.pem

Disabling TLS Verification (Development Only)

To disable TLS verification for the Philter backend:

philter:
  tlsVerify: false

To disable TLS verification for a specific LLM provider:

providers:
  ollama:
    target: https://ollama.internal:11434
    tlsVerify: false

Warning: Disabling TLS verification makes connections vulnerable to man-in-the-middle attacks. Only disable verification in trusted development environments.

Provider Timeouts

Every outbound HTTP client the proxy creates (Philter, the four built-in LLM providers, every openaiCompatible entry, and Bedrock) honors a configurable set of transport-level timeouts. They protect the proxy from a hung upstream (stalled LLM, dropped TCP, slow-loris attack) by bounding the network phases of each call without breaking streaming responses.

providers:
  openai:
    target: https://api.openai.com
    timeouts:
      connectMs: 5000          # TCP dial
      tlsHandshakeMs: 5000     # TLS handshake
      responseHeaderMs: 30000  # wait for upstream to start responding
      idleConnMs: 90000        # keep-alive idle eviction

The same timeouts: block is accepted under philter:, providers.bedrock:, and each providers.openaiCompatible.* entry.

Fields and defaults

Field	Default	What it bounds
`connectMs`	5000	TCP dial (`net.Dialer.Timeout`)
`tlsHandshakeMs`	5000	TLS handshake (`http.Transport.TLSHandshakeTimeout`)
`responseHeaderMs`	30000	Wait for response headers (`http.Transport.ResponseHeaderTimeout`). This is the timeout that catches a hung LLM that never starts responding.
`idleConnMs`	90000	Idle keep-alive eviction (`http.Transport.IdleConnTimeout`)

A value of 0 or an omitted field uses the default. All values are milliseconds.

Streaming and timeouts

The proxy deliberately does not set an overall request deadline (http.Client.Timeout). All four timeouts above are transport-phase timeouts — once the upstream has sent response headers, the body can stream for as long as the upstream keeps producing data. This means:

A hung LLM that accepts the connection but never starts streaming is killed by responseHeaderMs (default 30s).
A long-running streaming completion that takes 5 minutes to finish writing the body is not killed by any timeout, and that is the intended behavior.

If you need a hard ceiling on streaming wall-clock time you must enforce it at the client, with an ingress-level connection timeout, or by adding cancellation logic to your application.

When to tune

Faster responseHeaderMs for an in-cluster Philter. The 30s default fits LLM round-trips; a same-cluster Philter typically responds in single-digit milliseconds, and a 1-2s responseHeaderMs will surface backend issues much faster.
Slower responseHeaderMs for slow models or reasoning APIs. Some chain-of-thought / o1-style endpoints take 60+ seconds before the first token. Raise the default if you see spurious 502s on otherwise-healthy traffic.
Tighter connectMs for in-cluster providers. Local services should connect in milliseconds; a tighter dial timeout helps shed traffic to dead pods faster than the default 5s.

Concurrency Limits

The proxy can cap the number of requests it processes at any one time. When the cap is reached the proxy returns 503 Service Unavailable with Retry-After: 1 instead of queuing the request or running out of resources. Concurrency limits are disabled by default for backwards compatibility.

listen:
  maxConcurrentRequests: 200   # global in-flight cap; 0 (default) = unlimited

auth:
  apiKeys:
    - key: noisy-tenant
      maxConcurrent: 20        # per-key in-flight cap; applied in addition to the global cap

The global and per-key caps compose - a request must acquire both. The per-key cap protects the shared pool from a single noisy tenant; the global cap protects the proxy as a whole.

!!! warning "Pair concurrency caps with listen.readTimeoutMs for hostile clients" The proxy acquires its concurrency slot before reading the request body, so a slow-body uploader holds the slot for the duration of its upload. With listen.readTimeoutMs disabled (the documented default for large/slow legitimate uploads), a single authenticated key whose value has been compromised can dribble bodies indefinitely and hold maxConcurrent slots; with multiple compromised keys the attacker can hold keys × maxConcurrent slots. When you configure maxConcurrent to defend against this class of abuse, also set listen.readTimeoutMs to a value that bounds reasonable upload time (e.g. 60000 for 60s). See Request Hardening.

Behaviour when the limit is exceeded

When either cap is reached, the proxy returns:

HTTP status 503 Service Unavailable
Headers: Retry-After: 1, Content-Type: application/json
JSON body: {"error":{"message":"concurrency limit exceeded","type":"capacity"}}

The Retry-After value is fixed at 1 second because, unlike rate limits, there is no deterministic time at which a concurrency slot will free up.

A structured warning is logged with the scope (global or per_key) and the client identifier:

{"time":"...","level":"WARN","msg":"Concurrency limit exceeded","scope":"per_key","client":"noisy-tenant"}

Choosing a value

A defensible starting point:

maxConcurrentRequests = 2 × (target_rps × p95_provider_response_seconds)

The 2× is headroom for tail latency and short bursts. Cross-check against:

Your LLM provider's concurrent-request quota. Set the proxy cap no higher than what your account can actually serve - otherwise you push work into the provider's queue and lose the back-pressure signal here.
File descriptors. Each in-flight request needs ~3 sockets (client + Philter + provider). Default ulimit -n of 1024 is exhausted around ~330 concurrent. Raise it before raising the cap.
Memory. Each in-flight request holds one goroutine plus buffered request/response bodies (rough estimate: 50–200 KB per request). 1,000 concurrent ≈ 50–200 MB of proxy state.

See the Monitoring page for the metrics to watch and a PromQL recipe for computing utilization.

Request Hardening

The proxy is network-facing, so it bounds the size and duration of inbound client requests in addition to the concurrency (count) and provider timeout (outbound) limits. These are configured under listen and applied with secure defaults when unset:

listen:
  maxRequestBodyBytes: 10485760   # 10 MiB; larger bodies → HTTP 413
  maxHeaderBytes: 1048576         # 1 MiB
  readHeaderTimeoutMs: 10000      # 10s to send headers (slowloris mitigation)
  readTimeoutMs: 0                # 0 = disabled; whole-request (incl. body) read bound
  tlsHandshakeTimeoutMs: 10000    # 10s to complete the TLS handshake
  maxConcurrentTLSHandshakes: 16384  # ceiling on simultaneous in-flight handshakes

Protection	Field	Default	Behaviour
Body size	`maxRequestBodyBytes`	10 MiB	The body is wrapped in a hard limit; exceeding it returns `413 Too Large` (`payload_too_large` / `request_body_too_large`) and the connection is closed. Raise it if you send large multimodal (base64 image) requests.
Header size	`maxHeaderBytes`	1 MiB	Caps total request header bytes (matches net/http's default).
Slowloris (headers)	`readHeaderTimeoutMs`	10s	Bounds how long a client may take to send the request headers; a client that dribbles headers to hold the connection open is dropped.
Slow body	`readTimeoutMs`	disabled	Bounds reading the whole request (headers + body). Opt-in, because a too-low value would truncate large or slow legitimate uploads. It affects only request reads.
Slowloris (handshake)	`tlsHandshakeTimeoutMs`	10s	Bounds how long a client may take to complete the TLS handshake. `readHeaderTimeoutMs` only starts ticking after the handshake completes, so a client that opens a TLS connection and then dribbles the handshake (or never finishes it) would otherwise tie up the connection indefinitely. Each accepted connection is gated by this deadline on its own goroutine, so one slow client cannot stall accepts of other clients. Once the handshake succeeds the deadline is cleared, so post-handshake reads and response streaming are unaffected.
Handshake flood	`maxConcurrentTLSHandshakes`	16384	Ceiling on the number of TLS handshakes in flight at once. `tlsHandshakeTimeoutMs` bounds the duration of each handshake but not how many run concurrently: under a TCP+ClientHello flood, every accepted connection would otherwise spawn a goroutine pinned for the full handshake timeout. When the ceiling is reached, new connections are dropped immediately (not queued) and counted by `philter_proxy_tls_handshakes_shed_total`. The slot is released the instant a handshake resolves — before the connection is handed to net/http — so this gates only the handshake phase and never throttles established connections. The default is far above any real workload; lower it only if you want a tighter bound on peak handshake memory.

Streaming is unaffected. The proxy deliberately does not set a write timeout, so streamed responses can run arbitrarily long. readTimeoutMs bounds only the inbound request, never the response. The same header limits and timeouts are applied to the metrics server; the handshake timeout applies only to the TLS-terminating listener.

Trusted Proxies / X-Forwarded-For

The proxy uses the apparent client IP for per-IP rate limiting (when authentication is disabled), for the audit log's client_ip field, and for operator-facing log lines such as the admin-endpoint access record.

By default, r.RemoteAddr -- the immediate TCP peer -- is used, and X-Forwarded-For is ignored. This is the safe behavior when the proxy is exposed directly to clients: any attacker could otherwise set XFF to a value of their choosing, evading per-IP rate limits and corrupting audit-log IPs.

When the proxy runs behind a trusted upstream (ALB, NLB, Nginx, Cloudflare, an Istio sidecar, etc.), listen.trustedProxies must list the CIDR ranges those upstreams connect from, so the proxy can recognize them and honor the XFF they set:

listen:
  trustedProxies:
    - 10.0.0.0/8         # internal LB subnet
    - 172.16.0.0/12      # peered VPC
    - 192.168.1.0/24

Behavior:

If r.RemoteAddr's IP falls inside any configured CIDR, the left-most non-empty X-Forwarded-For entry is taken as the client IP.
If the peer is not in any CIDR (or no CIDRs are configured at all), XFF is silently ignored and r.RemoteAddr is used.
Each CIDR is validated at startup; a malformed entry fails the config.

This is a behavioral change vs earlier releases, which trusted XFF unconditionally. Deployments that legitimately relied on XFF (those running behind a real LB) need to add the LB's source CIDR(s) to restore the previous behavior.

These limits apply per request and are independent of the concurrency guard: concurrency bounds how many requests run at once, while these bound how big and how slow any single request may be.

Token Quotas

Token quotas cap cumulative token consumption per API key over a calendar window, distinct from rate limits (which bound request frequency). Use them for hard cost ceilings and multi-tenant budgets. Quotas are disabled by default.

quota:
  enabled: true
  default:                  # applies to keys without their own quota
    dailyTokens: 1000000    # 0 = unlimited
    monthlyTokens: 20000000
  backend:
    type: memory            # or "redis" to share counters across replicas
    # redis:
    #   address: redis.internal:6379
    #   password: ${REDIS_PASSWORD}

auth:
  apiKeys:
    - key: ${TEAM_A_KEY}
      quota:                # per-key override (takes precedence over default)
        dailyTokens: 50000
        monthlyTokens: 1000000

How it works. Each request's prompt + completion tokens (the same counts in the audit log and Prometheus token metrics) accrue against the key's current UTC day and month windows. A request is checked before it is forwarded: if the key has already reached either limit, the proxy returns 429 Too Many Requests with a Retry-After header pointing at the window reset (next UTC midnight for daily, first of next UTC month for monthly — the longer window wins when both are exceeded). Windows reset automatically; there is no manual reset.

The error body uses type quota_exceeded with code daily_quota_exceeded or monthly_quota_exceeded.

Notes.

Quotas apply only to authenticated keys (there is no key to bill otherwise).
A request that has started is never interrupted mid-flight; the next request after a window is exhausted is the one rejected. Token counts are only known after the response, so a single request may push a key slightly past its limit before the next one is blocked.
Cache hits (see below) still consume quota only if they reach the provider; a served cache hit consumes no new tokens.
With backend.type: memory, counters are per-replica — use redis for a consistent quota across a multi-replica deployment. On a Redis error the check fails open (allows the request) so an infrastructure blip never hard-blocks traffic.

`quota` reference

Field	Type	Default	Description
`enabled`	bool	`false`	Enable token quotas.
`default.dailyTokens`	int	`0`	Per-key daily token cap applied to keys without their own quota. `0` = unlimited.
`default.monthlyTokens`	int	`0`	Per-key monthly token cap. `0` = unlimited.
`backend.type`	string	`memory`	`memory` (per-replica) or `redis` (shared). Also stores usage for the admin export.
`backend.redis.*`	—	—	Same Redis fields as the rate-limit backend (address, password, db, keyPrefix, timeoutMs, tls).

Per-key overrides live on auth.apiKeys[].quota.{dailyTokens,monthlyTokens}.

Response Cache

The optional response cache returns a stored response for repeated prompts, skipping both Philter and the LLM provider to cut cost and latency. It is disabled by default.

cache:
  enabled: true
  ttlSeconds: 300       # entry lifetime; default 300
  maxEntries: 1024      # in-memory cap (memory backend only); default 1024
  maxBodyBytes: 1048576 # responses larger than this are not cached; default 1 MiB
  backend:
    type: memory        # or "redis" to share the cache across replicas
    # redis:
    #   address: redis.internal:6379

Cache key. Entries are keyed on (API key, model, sha256(request body)). Because the tenant key is part of the key, one tenant can never read another tenant's cached response, and a different model or any change to the request body is a different entry. When auth is disabled, all clients share an anon namespace.

What is cached. Only non-streaming ("stream": true is excluded, as are Gemini streamGenerateContent and Bedrock converse-stream paths), POST, 2xx responses up to maxBodyBytes. Larger or streaming responses pass through uncached. Responses carry an X-Cache: HIT or X-Cache: MISS header so clients and dashboards can see cache behavior. A hit is served without calling Philter or the provider.

Backends. memory is a per-replica LRU-ish cache bounded by maxEntries; redis shares entries across replicas (TTL enforced by Redis). A Redis read/write failure is treated as a miss and never fails the request.

`cache` reference

Field	Type	Default	Description
`enabled`	bool	`false`	Enable the response cache.
`ttlSeconds`	int	`300`	Lifetime of a cached entry.
`maxEntries`	int	`1024`	Maximum in-memory entries (memory backend only).
`maxBodyBytes`	int	`1048576`	Responses larger than this are not cached.
`backend.type`	string	`memory`	`memory` or `redis`.
`backend.redis.*`	—	—	Same Redis fields as the rate-limit backend.

Cache hit/miss counters are exported as philter_proxy_cache_hits_total / philter_proxy_cache_misses_total; see Monitoring.

Usage Export (Admin API)

When enabled, GET /admin/usage returns per-key token usage for billing and quota inspection. It is disabled by default and protected by an admin token.

admin:
  enabled: true
  token: ${PHILTER_ADMIN_TOKEN}   # required; accepts ${ENV_VAR} / file: references
  header: x-philter-admin-token   # optional; this is the default

Usage is tracked whenever admin.enabled or quota.enabled is set, using quota.backend for storage (so the export and quota enforcement read the same counters).

Request. Send the admin token in the configured header. JSON is returned by default; ?format=csv returns CSV.

curl -k https://localhost:8080/admin/usage \
  -H "x-philter-admin-token: $PHILTER_ADMIN_TOKEN"

curl -k "https://localhost:8080/admin/usage?format=csv" \
  -H "x-philter-admin-token: $PHILTER_ADMIN_TOKEN"

JSON response. Per key: the current UTC day/month windows with their token sums, and lifetime prompt/completion totals.

{
  "usage": [
    {
      "key_id": "key-0",
      "day": "2026-05-28", "day_tokens": 1500,
      "month": "2026-05", "month_tokens": 42000,
      "total_prompt_tokens": 38000, "total_completion_tokens": 12000
    }
  ]
}

Keys are identified by their stable opaque ID (key-0, key-1, …, by position in auth.apiKeys), never the raw key value — the same identifier used in logs and per-key rate-limit/concurrency buckets.

Behaviour:

Scenario	Result
Valid admin token	`200` with JSON (or CSV) usage
Missing/invalid token	`401 Unauthorized` (constant-time comparison)
Non-GET method	`405 Method Not Allowed`
`admin.enabled: false`	`404 Not Found`

Every access is logged: a successful export emits an Admin usage exported line (with client IP, format, and key count — never the token), and a failed-auth attempt emits an Admin usage access denied line.

Hardening. The endpoint exposes per-customer billing data, so:

Use a high-entropy admin token (e.g. openssl rand -hex 32) supplied via a ${ENV_VAR} / file: reference, not a literal in the YAML.
The admin path is not subject to the request rate limiter, so token guesses are not throttled by the proxy. Rely on the strong token and keep the endpoint behind network controls (firewall/VPC/service mesh) or listen.clientCA mTLS where possible. The Admin usage access denied log lines give you a brute-force signal to alert on.

`admin` reference

Field	Type	Default	Description
`enabled`	bool	`false`	Enable the `GET /admin/usage` endpoint.
`token`	string	(required when enabled)	Admin token. Accepts `${ENV_VAR}` / `file:` secret references.
`header`	string	`x-philter-admin-token`	Header carrying the admin token.

Error Responses

Every error the proxy generates uses the same structured JSON shape and the same set of stable codes. Clients can parse these reliably to drive retry, alerting, and routing.

Response shape

{
  "error": {
    "message": "human-readable description",
    "type": "broad-category enum",
    "code": "specific-reason enum",
    "request_id": "uuid-or-X-Request-Id-from-caller"
  }
}

Content-Type: application/json is set on every error.
X-Request-Id is set on every response (success and error) with the same value as error.request_id.
An inbound X-Request-Id request header is honored when present; otherwise a UUID is generated.

Stable enum

The (type, code) set below is part of the proxy's public API. New codes may be added in any release. Existing codes will not be removed or repurposed across minor versions.

Status	`type`	`code`	Trigger	`Retry-After`
400	`invalid_request`	`bad_json`	Request body is not valid JSON for the matched provider	-
400	`invalid_request`	`body_read`	Request body could not be read from the client connection	-
400	`invalid_request`	`path_not_canonical`	Request path contained `.` / `..` segments, redundant slashes, or a trailing slash. Real LLM clients construct canonical paths; the proxy refuses non-canonical paths up front to close a class of path-traversal-based scope bypass.	-
413	`payload_too_large`	`request_body_too_large`	Request body exceeded `listen.maxRequestBodyBytes`	-
401	`unauthorized`	`missing_api_key`	Auth enabled and no key in the configured header	-
401	`unauthorized`	`invalid_api_key`	Auth enabled and the supplied key was not recognised	-
403	`pii_blocked`	`outbound_blocked`	Outbound scanning is on with `action: block` and PII was found in the provider response	-
403	`forbidden`	`scope_denied_provider`	Resolved provider is not in the authenticated key's `auth.apiKeys[].scopes.providers` allow-list	-
403	`forbidden`	`scope_denied_model`	Request `model` is not in the key's `scopes.models` allow-list (or no `model` set when the allow-list is configured)	-
403	`forbidden`	`scope_denied_path`	Request path is not in any of the key's `scopes.paths` prefix entries	-
404	`not_found`	`bedrock_disabled`	A Bedrock path was requested but `providers.bedrock.region` is unset	-
404	`not_found`	`azure_disabled`	An Azure path (`/openai/deployments/...`) was requested but `providers.azure.target` is unset	-
404	`not_found`	`vertex_disabled`	A Vertex path (`/v1/projects/.../models/...:generateContent`) was requested but `providers.vertex.project` is unset	-
502	`provider_error`	`vertex_auth_failed`	The proxy could not acquire a Google ADC bearer token for Vertex	-
404	`not_found`	`admin_disabled`	`/admin/usage` was requested but `admin.enabled` is false	-
401	`unauthorized`	`invalid_admin_token`	`/admin/usage` requested with a missing or wrong admin token	-
405	`method_not_allowed`	`method_not_allowed`	`/admin/usage` requested with a non-GET method	-
429	`rate_limit_error`	`rate_limited`	Rate-limit token bucket exhausted for this client	seconds until refill
429	`quota_exceeded`	`daily_quota_exceeded`	Per-key daily token quota reached	seconds until next UTC midnight
429	`quota_exceeded`	`monthly_quota_exceeded`	Per-key monthly token quota reached	seconds until first of next UTC month
500	`internal_error`	`marshal_failed`	Re-serialising the redacted request body failed (should not occur in normal operation)	-
500	`internal_error`	`request_creation_failed`	`http.NewRequest` failed when building the upstream call (typically an invalid target URL)	-
500	`internal_error`	`bedrock_sign_failed`	AWS SigV4 signing failed (credentials cannot be retrieved)	-
500	`internal_error`	`usage_snapshot_failed`	`/admin/usage` could not read the usage store	-
502	`provider_error`	`unreachable`	Upstream LLM provider connection failed (DNS, dial, TLS)	-
502	`provider_error`	`azure_auth_failed`	Entra ID token acquisition failed for an Azure request (`providers.azure.entraID: true`)	-
502	`provider_error`	`response_read_failed`	Connected to the provider but failed to read the response body	-
502	`philter_error`	`request_failed`	Philter call failed (network or non-2xx response) and retries were exhausted	-
503	`capacity`	`concurrency_exceeded`	`listen.maxConcurrentRequests` or a per-key cap was hit	`1`
503	`circuit_open`	`philter_unavailable`	Philter circuit breaker is open with `fallback: block`	-

Errors forwarded from upstream LLM providers are passed through unchanged and follow the provider's own error format, not the schema above. The codes here apply only to errors the proxy itself generates.

Audit correlation

Every error response is mirrored in the audit log: the same request_id, error_type, and error_code fields appear on the inbound audit entry. To trace a single failed request end-to-end:

Grab the X-Request-Id header from the client's response.
Search audit logs for request_id=<that value>.
The matching entry's error_type and error_code will equal what the client saw.

Configuration

Example Configuration

Configuration Compatibility

Configuration Reference

version

listen

logging

metrics

tracing

philter

philter.retry

philter.circuitBreaker

providers

providers.openaiCompatible

providers.bedrock

providers.azure

providers.vertex

routes

defaults

outbound

Rate Limiting

Configuration

rateLimit reference

Shared state for multi-replica deployments

rateLimit.backend reference

Behaviour when the limit is exceeded

Client identification

Authentication

API Key Authentication

auth reference

auth.apiKeys[] entry

Per-key Authorization (scopes)

Admin Roles

API Key Hashing

Per-key Stable Identifiers

Loading secrets from environment variables and files

Rotating API keys

mTLS (Mutual TLS)

Audit Logging

Log Schema

Example Log Entries

SIEM Integration

Streaming

TLS Configuration

Philter Backend with Self-Signed Certificate

Disabling TLS Verification (Development Only)

Provider Timeouts

Fields and defaults

Streaming and timeouts

When to tune

Concurrency Limits

Behaviour when the limit is exceeded

Choosing a value

Request Hardening

Trusted Proxies / X-Forwarded-For

Token Quotas

quota reference

Response Cache

cache reference

Usage Export (Admin API)

admin reference

Error Responses

Response shape

Stable enum

Audit correlation

`version`

`listen`

`logging`

`metrics`

`tracing`

`philter`

`philter.retry`

`philter.circuitBreaker`

`providers`

`providers.openaiCompatible`

`providers.bedrock`

`providers.azure`

`providers.vertex`

`routes`

`defaults`

`outbound`

`rateLimit` reference

`rateLimit.backend` reference

`auth` reference

`auth.apiKeys[]` entry

`quota` reference

`cache` reference

`admin` reference