Data sources¶
Data sources let admins point Arbiter at an external place where documents already live — an OpenSearch or Elasticsearch index, an S3 bucket, a relational database, or a local filesystem directory — instead of (or in addition to) uploading files one at a time. Once registered, a data source becomes selectable on the Add Documents page.
The page is at /admin/data-sources, also reachable from the Data Sources
link in the sidebar's Administration section. ROLE_ADMIN only.
Data sources are the input half of Arbiter's I/O. For where finalized redacted documents go out, see Destinations.
Status: All five source types — OpenSearch, Elasticsearch, S3, Relational Database, and Local Directory — are fully wired up. Clicking Ingest from … on the Add Documents page kicks off a background job that pulls documents into the redaction queue; progress is visible on the Background Jobs page.
Source types¶
The page is split into five sections, one per type. Names must be unique
within a section (case-insensitive) — so an OpenSearch source and an S3
source can both be named archive without conflict.
OpenSearch¶
| Field | Required | Notes |
|---|---|---|
| Name | yes | Display name; unique among OpenSearch sources (case-insensitive) |
| Endpoint | yes | Cluster URL, e.g. http://localhost:9200 |
| Query | yes | Query that returns the documents to import — typically the JSON body of an _search request, prefixed with the index path (e.g. contracts/_search { "query": { "match_all": {} } }). The body's size is always overridden to 100 for paging — see Paged ingest below. |
| Text field | yes | Field name within each hit's _source that holds the document text to import (e.g. body, content). |
| Filename field | no | Optional. Field within each hit's _source whose value is used as the imported document's filename. If blank, the OpenSearch _id is used as the filename. |
| Username | no | Stored as plaintext on the document row. |
| Password | no | Encrypted at rest; see Credential encryption. |
The listing table shows Name, Endpoint, Text field, and Actions. Auth and the saved query are not surfaced inline; the saved query is editable through the Edit action button on each row.
Edit a saved source¶
Each row has an Edit button that opens a popup pre-filled with the
saved values (the Name is read-only; everything else can be changed).
The Password field is left blank by default — leaving it blank keeps the
existing password, typing a new value replaces it, and the Clear the
stored password checkbox wipes it. Saving fires an
OPENSEARCH_DATASOURCE_UPDATE audit event with the new values plus a
passwordChanged boolean.
Test the connection¶
Two Test buttons are available:
- The Test button beside the Add button on the form sends the current form values (endpoint, query, optional username/password) to Arbiter — handy while you're tuning a new source before saving it.
- A Test action button on each saved row sends the stored values (Arbiter decrypts the saved password server-side; the cleartext is never handed back to the browser).
Either way, Arbiter connects to OpenSearch, runs the query, and shows up to the first 10 hits in a popup along with the reported total. A green Success bar tops the popup when the query worked; otherwise a red error bar shows the reason. Test does not save anything, so it's safe to use repeatedly.
Elasticsearch¶
Elasticsearch data sources behave identically to OpenSearch: same fields, same scrolling ingest, same Test / Edit affordances. They live in their own collection, so the same name can be used for one OpenSearch and one Elasticsearch source without conflict.
| Field | Required | Notes |
|---|---|---|
| Name | yes | Display name; unique among Elasticsearch sources (case-insensitive) |
| Endpoint | yes | Cluster URL, e.g. http://localhost:9200 |
| Query | yes | Same shape as OpenSearch — <index>/_search { … JSON body … }. The body's size is overridden to 100. |
| Text field | yes | Field name within each hit's _source that holds the document text. |
| Filename field | no | Optional. Field within each hit's _source whose value is used as the imported document's filename. If blank, the Elasticsearch _id is used. |
| Username | no | Stored as plaintext on the document row. |
| Password | no | Encrypted at rest; see Credential encryption. |
Paged ingest¶
OpenSearch and Elasticsearch ingest jobs use the scroll API so they can pull arbitrarily large result sets without exhausting memory:
- The first request opens a scroll context (
?scroll=1m) with the saved query body, forcing"size": 100so each page is bounded — even if the saved body says"size": 5000, only 100 hits land per page. - The worker walks the page, pulls each hit's text-field value, sets traceability fields (see below), and pushes the document onto the Arbiter ingest queue.
- It then asks the server for the next scroll page and keeps going until a page returns zero hits, at which point it deletes the scroll context.
Each imported document records traceability back to its source: the
sourceSystem (OPENSEARCH or ELASTICSEARCH), the cluster URL, the
index the hit came from (preferred from each hit's _index, falling back
to the configured query path), the OpenSearch / Elasticsearch _id, and an
Import Timestamp of when Arbiter pulled it. Reviewers see all of this
on the Document Information popup on the Review page.
Job progress is visible in real time on the Background Jobs page (under
the Redaction sidebar group). Each job tracks status (PENDING /
RUNNING / COMPLETED / FAILED), how many documents have been
processed vs. the total, and — when individual hits fail — a list of
per-hit reasons accessible behind a Show failure details disclosure.
S3-Compatible¶
| Field | Required | Notes |
|---|---|---|
| Name | yes | Unique among S3 sources |
| Endpoint URL | no | S3 API endpoint to target. Leave blank for Amazon S3; set to e.g. http://minio:9000 for MinIO, or to the API endpoint published by Cloudflare R2 / Backblaze B2 / any other S3-compatible storage. |
| Bucket name | yes | The bucket to read from |
| Bucket key | yes | Object-key prefix under which to look (e.g. archive/2026/). Leave blank to read from the bucket root. |
| Filename glob | yes | Filter applied within the prefix, e.g. *.txt, **/*.pdf |
| Access key | no | Encrypted at rest |
| Secret key | no | Encrypted at rest |
Access key and secret key are validated as a pair — provide both or neither. Leaving them blank means the runtime will use whatever ambient AWS credentials the application process has (environment variables, instance profile, etc.). Credentials are not shown in the listing table. The Endpoint URL column shows "AWS default" in italics when the field is left blank.
The shipped Docker compose stack registers a Demo MinIO (S3-compatible)
data source pointing at the bundled minio service (http://minio:9000,
bucket arbiter-demo) so the S3 path can be exercised without an AWS
account. See Getting started for the full demo
layout.
Recommended IAM policy¶
Grant the configured access key read-only access scoped to the bucket and
key prefix above — and nothing else. Ingestion needs s3:GetObject on the
objects under the prefix, and s3:ListBucket (constrained by an
s3:prefix condition) to enumerate them. No write, delete, or bucket-admin
permissions are required.
The following policy is a starting point — replace my-bucket and
archive/2026/ with your actual bucket name and key prefix:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ArbiterListPrefix",
"Effect": "Allow",
"Action": "s3:ListBucket",
"Resource": "arn:aws:s3:::my-bucket",
"Condition": {
"StringLike": {
"s3:prefix": ["archive/2026/*"]
}
}
},
{
"Sid": "ArbiterReadObjects",
"Effect": "Allow",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::my-bucket/archive/2026/*"
}
]
}
This is a suggestion, not a substitute for AWS's own guidance. For authoritative details on writing least-privilege S3 policies — including prefix conditions, object-tag conditions, SSE-KMS key access, and the permission semantics for non-AWS S3-compatible services (MinIO, Cloudflare R2, Backblaze B2, etc.) — refer to the AWS documentation:
- Identity and access management in Amazon S3
- Bucket policies and user policies
- Actions, resources, and condition keys for Amazon S3
For non-AWS S3-compatible services, consult that vendor's documentation — the action names and ARN format are usually compatible, but the policy attachment mechanics differ.
Relational Database¶
| Field | Required | Notes |
|---|---|---|
| Name | yes | Unique among RDB sources |
| JDBC URL | yes | Standard JDBC URL, e.g. jdbc:postgresql://host:5432/dbname |
| SQL query | yes | SELECT whose first column holds the document text. An optional column named filename (case-insensitive) supplies the per-row filename. See How rows map to documents below. |
| Watermark column | no | Result-column name whose value Arbiter reads from each row and advances the per-source watermark to on a successful run. Pair with a :lastKey placeholder in the SQL above for incremental ingest — see Incremental ingest with watermarks below. Leave blank for a full-scan ingest each time. |
| Username | no | Encrypted at rest |
| Password | no | Encrypted at rest |
Username and password are validated as a pair — provide both or neither. Leaving them blank means the runtime will rely on credentials embedded in the JDBC URL or the driver's ambient authentication. Credentials are not shown in the listing table.
Read-only safeguard¶
Relational data sources are intended for reading documents, never
mutating them. Before saving, Arbiter scans the SQL for the keywords
DELETE, TRUNCATE, and DROP as whole words
(case-insensitive). If any of those appear, the data source is rejected
— nothing is saved — and the admin sees an error like "SQL query contains
disallowed keyword(s) DELETE. Data sources must use read-only queries."
The matched keywords plus the offending SQL are recorded in the audit log
under RDB_DANGEROUS_SQL_BLOCKED (see the Audit trail
section below).
The match is whole-word-only, so legitimate column names like
dropoff_count or deleted_at do not trigger the safeguard.
How rows map to documents¶
Each row in the result set becomes one document. Arbiter reads two things from every row:
-
Document text — the value of column 1 of the SELECT list, whatever the admin called it. This is a positional lookup: the column's name is irrelevant; only its position matters. A row whose first column is
NULLor empty is recorded as a failure with reason "Row N has a null/empty first column (no document text)" and never becomes a document. -
Filename — the value of a column named
filename, looked up by name (case-insensitive —Filename,FILENAME, andfilenameall match). The value is used verbatim — no extension is added, no characters are stripped. When the column is absent, or present but null/empty for that row, the worker synthesisesrow-1.txt,row-2.txt, … using the 1-based row index within the current run.
The conventional query shape:
SELECT body, doc_path AS filename, ...
^^^^ ← text (column 1, name irrelevant)
^^^^^^^^^^^^^^^^^^^^^ ← optional, looked up by name
FROM documents WHERE imported_at IS NULL
A SELECT * works as long as the schema happens to put the text column
first; explicit aliasing is safer.
Filenames are the dedupe key. Arbiter records
(data-source id, filename)for every imported row and skips subsequent rows with the same key. If you rely on the synthesisedrow-N.txtfallback and the underlying table is later reordered or rows are inserted in the middle of the existing set, the same content will be assigned a different synthesised filename on the next run and re-imported. Returning a stable per-row identifier (primary key, content hash, etc.) asfilenameis strongly preferred for any source you intend to re-ingest incrementally.
Incremental ingest with watermarks¶
A source that fills in the Watermark column field uses a keyset cursor instead of a full table scan. Each run only sees rows the previous run hasn't already imported, so a daily-scheduled ingest against a 10-million-row table doesn't re-read the whole 10M every night — it picks up only what's new.
The mechanism has three moving parts:
- The
:lastKeyplaceholder in the SQL. Reference it in aWHEREpredicate against the same column you name as the watermark. Arbiter substitutes the stored watermark value at execution time: - On the first run (or after a reset),
:lastKeybecomes the SQL literalNULL. UseCOALESCE(:lastKey, …)to pick a starting floor. -
On every subsequent run,
:lastKeybecomes a bound JDBC parameter holding the value of the watermark column from the last row of the previous run. -
The Watermark column. Names the result column whose value Arbiter reads from each row. After a successful run completes, the source's stored watermark is advanced to that column's last-seen value (which, given
ORDER BYon the same column, is also the largest value seen) and anRDB_WATERMARK_ADVANCEaudit row is written. Only successful runs advance the watermark — a crash mid-cursor leaves it untouched, and the next run resumes from where the previous one started. -
The
ORDER BYon the watermark column. Without it the "last-seen value" is whatever the database happened to emit last, which is undefined for aSELECTwithout anORDER BY. Arbiter does not enforce this — it's the admin's responsibility. A missingORDER BYwon't break ingest, but the watermark will advance erratically and miss rows.
The canonical incremental-ingest shape:
=== "PostgreSQL"
sql
SELECT body,
doc_path AS filename,
id
FROM documents
WHERE id > COALESCE(:lastKey::bigint, 0)
ORDER BY id
Watermark column: id.
=== "MySQL / MariaDB"
sql
SELECT body,
doc_path AS filename,
id
FROM documents
WHERE id > COALESCE(CAST(:lastKey AS UNSIGNED), 0)
ORDER BY id
Watermark column: id. Note the explicit CAST — MySQL's
implicit coercion against a bound String is unreliable for
range comparisons against an integer column.
=== "SQL Server"
sql
SELECT body,
doc_path AS filename,
id
FROM documents
WHERE id > COALESCE(CAST(:lastKey AS BIGINT), 0)
ORDER BY id
Watermark column: id. SQL Server refuses implicit
String → BIGINT casts, so the explicit CAST is mandatory rather
than recommended.
=== "Oracle"
sql
SELECT body,
doc_path AS filename,
id
FROM documents
WHERE id > COALESCE(TO_NUMBER(:lastKey), 0)
ORDER BY id
Watermark column: id.
The watermark column doesn't have to be an integer. A monotonic
timestamp works the same way (WHERE updated_at > COALESCE(:lastKey,
'1970-01-01')), as does a UUIDv7 PK ordered lexically. The only hard
requirements are that the column is strictly monotonic across
inserts you want to ingest and that the SQL's ORDER BY matches.
Changing the SQL or watermark column on an active source¶
When a source has already advanced its watermark and the admin edits either the SQL or the watermark column, Arbiter refuses the save until the Reset watermark confirmation is checked on the Edit form. The reasoning is symmetric:
- If the new SQL keeps the watermark and the meaning is unchanged, the old value is still valid and the change shouldn't reset it (so the operator unchecks the confirmation, which can't be done unless they leave the relevant fields alone).
- If the new SQL means something different — different table, different ordering, different predicate — the old watermark would either miss new rows or re-import old ones. The reset is then the right call, and the explicit confirmation is the operator's acknowledgement that they understand.
The Edit modal shows the confirmation row only when the SQL or watermark column has actually changed on a source that has an active watermark; URL and credential edits don't require it.
Manual watermark adjustments¶
Two operator overrides are available on the Data Sources page:
- Reset watermark clears the stored value, so the next run starts
from scratch (already-imported documents are still skipped via the
(source-id, filename)dedupe). Audited asRDB_WATERMARK_RESET. - Set watermark manually writes a specific value to the source.
Useful for skipping past a known-bad range, or seeding the cursor
before the first run so older rows are ignored. Audited as
RDB_WATERMARK_SET_MANUAL.
When NOT to use a watermark¶
If your source already filters to "new rows" some other way — e.g.
the SQL has WHERE imported_flag = 0 and your application sets
imported_flag = 1 out of band — leave Watermark column blank.
The two mechanisms are independent; mixing them adds complexity for
no benefit. The 100,000-row per-run cap still applies, so a source
that needs to drain a backlog larger than that should use one of the
two approaches (out-of-band flag or watermark) rather than relying
on the cap to throttle.
Running an ingest¶
When the user clicks Ingest from Relational Database on the Add Documents page, the worker opens a JDBC connection using the stored URL and credentials, runs the saved SQL, and enqueues one document per row according to the column mapping above. The job is subject to:
- Dedupe keyed on
(data-source id, filename). New rows go in; previously-seen filenames are recorded as Skipped rather than re-imported. - Cap: at most 100,000 rows per run, enforced via
Statement.setMaxRows(…)at the JDBC protocol level so the database doesn't stream more data than the ingest will use. - Timeouts: 15s connection timeout, 5min per-statement query timeout. An accidentally-broad query against a huge table can't pin a worker thread indefinitely.
The job is a Background Job and
reports Processed N / Failed M / Skipped K the same way every other
ingest does. If the JDBC connection fails (host unreachable, bad
credentials, table doesn't exist) the job is marked FAILED with the
underlying SQLSTATE plus driver message so operators can diagnose without
hunting through the application log.
The shipped Docker compose stack registers a Demo PostgreSQL data
source pointing at a bundled postgres container, whose documents
table is seeded with synthetic PII-shaped rows so the RDB ingest path
can be exercised without standing up a database by hand. See
Getting started for the full demo layout.
Example SQL queries¶
The queries below cover the common shapes Arbiter operators reach for.
Every example follows the two contracts laid out in How rows map to
documents: column 1 is the document
text (positional, name irrelevant), and the optional filename
column supplies the per-row filename (named lookup, case-insensitive).
Basic — one row per documents table row¶
The simplest shape. Use this when there's a column that already holds the full document text and a separate column for the filename, and you want to ingest the whole table once (no incremental support).
SELECT body,
doc_path AS filename
FROM documents
If you want re-runs to skip already-imported rows without a watermark,
either keep the SQL filter on an out-of-band flag (WHERE imported_at
IS NULL) and update the flag yourself, or rely on the
(source-id, filename) dedupe — which only works when filename
returns stable, distinct values per row.
Composing the text from multiple columns¶
When the document text lives in several columns (e.g. a CRM ticket
with separate subject, body, and notes), concatenate them in the
SELECT. The redactor then sees the full composed text per row.
=== "PostgreSQL / SQL Server / Oracle (ANSI)"
sql
SELECT subject || E'\n\n' || body || E'\n\nNotes:\n' || COALESCE(notes, '')
AS document_text,
ticket_id::text AS filename
FROM support_tickets
WHERE status = 'closed'
=== "MySQL / MariaDB"
sql
SELECT CONCAT(subject, '\n\n', body, '\n\nNotes:\n', COALESCE(notes, ''))
AS document_text,
CAST(ticket_id AS CHAR) AS filename
FROM support_tickets
WHERE status = 'closed'
The first column (document_text) is whatever the concatenation
expression returns; its name and alias are decorative — the worker
reads column 1 by position.
Joining a documents table with a per-row metadata table¶
Sometimes the filename lives elsewhere — e.g. an attachment table
keyed on document_id. The join is yours to write; Arbiter only
needs the two contractual columns on the result.
SELECT d.body,
COALESCE(a.original_filename, d.id::text) AS filename
FROM documents d
LEFT JOIN attachments a
ON a.document_id = d.id AND a.role = 'original'
WHERE d.imported_into_arbiter = FALSE
The LEFT JOIN keeps documents without a matching attachment;
COALESCE falls back to the document id (cast to text) so every row
has a stable filename for the dedupe key.
Filtering by status¶
To pick up only documents in a particular workflow state, add a WHERE
clause. The validator allows arbitrary WHERE / JOIN / GROUP BY /
HAVING / ORDER BY / LIMIT — only INSERT/UPDATE/DELETE and
their kin are refused.
SELECT body,
case_number AS filename
FROM legal_cases
WHERE status IN ('NEEDS_REVIEW', 'PENDING_REDACTION')
AND created_at >= CURRENT_DATE - INTERVAL '90 days'
Incremental ingest with an integer primary key¶
The canonical keyset-paginated shape. Set Watermark column to
id; Arbiter advances it after each successful run. See
Incremental ingest with watermarks
for the cross-driver variants and the COALESCE first-run idiom.
SELECT body,
doc_path AS filename,
id
FROM documents
WHERE id > COALESCE(:lastKey::bigint, 0)
ORDER BY id
Incremental ingest by timestamp¶
When the table doesn't have a monotonic integer PK but does have a
reliable updated_at column, use that as the watermark. Set
Watermark column to updated_at. The placeholder substitutes a
String, so cast it back to the column type in the predicate.
=== "PostgreSQL"
sql
SELECT body,
filename,
updated_at
FROM documents
WHERE updated_at > COALESCE(:lastKey::timestamptz, '1970-01-01'::timestamptz)
ORDER BY updated_at, id -- tiebreak on id so rows with the same
-- updated_at advance deterministically
=== "MySQL / MariaDB"
sql
SELECT body,
filename,
updated_at
FROM documents
WHERE updated_at > COALESCE(CAST(:lastKey AS DATETIME), '1970-01-01')
ORDER BY updated_at, id
Caveat. A timestamp watermark misses rows whose
updated_atequals the watermark from the previous run's last row, because the predicate is strictly>. The tiebreaker onidonly orders the rows correctly for theORDER BY; the next run starts fromupdated_at > X, which skips any other row at exactly timeX. For tables with multiple rows per timestamp, prefer an integer PK watermark, or design the predicate as(updated_at, id) > (:lastKey_ts, :lastKey_id)— which Arbiter doesn't natively support (only a single:lastKeyplaceholder is substituted), so this is a documented limitation.
Using a CTE / WITH clause¶
The validator accepts WITH … SELECT as a first-class top-level
shape. CTEs are useful when the per-row logic is complex enough that
inlining it as a subquery becomes unreadable.
WITH redactable AS (
SELECT d.id,
d.body,
d.filename,
ROW_NUMBER() OVER (PARTITION BY d.tenant_id ORDER BY d.id) AS rn
FROM documents d
WHERE d.body IS NOT NULL
)
SELECT body, filename, id
FROM redactable
WHERE rn <= 1000 -- one per tenant per run, oldest first
ORDER BY id
Note. Any mutating keyword inside the
WITHblock (e.g. PostgreSQL'sWITH d AS (DELETE … RETURNING *) SELECT …) is refused by the read-only safeguard — the CTE-with-DELETE pattern is exactly what the safeguard's whole-word keyword scan was built to catch.
What NOT to do¶
-- Refused: leading keyword must be SELECT or WITH.
EXPLAIN SELECT body FROM documents
-- Refused: multi-statement input. Remove the trailing semicolon.
SELECT body FROM documents;
-- Refused: mutating keyword anywhere in the statement.
WITH purged AS (DELETE FROM documents RETURNING *)
SELECT body FROM purged
-- Refused: SQL Server / MySQL write-via-SELECT primitive.
SELECT body FROM documents INTO OUTFILE '/tmp/exfil.csv'
-- Refused: PostgreSQL filesystem read primitive.
SELECT pg_read_file('/etc/passwd')
-- Accepted, but you almost certainly didn't mean it: the watermark
-- column is set but the SQL doesn't reference :lastKey. The
-- watermark advances after each run but does nothing to filter
-- the cursor — every run re-reads the entire table (capped at
-- 100,000 rows) and re-pays the dedupe cost.
SELECT body, id AS filename FROM documents ORDER BY id
Local Directory¶
| Field | Required | Notes |
|---|---|---|
| Name | yes | Unique among local sources |
| Directory path | yes | Absolute path on the application server's filesystem |
| Filename glob | yes | e.g. *.txt (top level only), **.pdf (any depth — ** greedily spans /) |
No credentials — the directory is read with the application's process identity, so make sure the path is reachable and readable by that user.
When the user clicks Ingest from Local Directory, the worker walks the
configured path and queues every regular file whose relative path
matches the configured glob. .pdf files are queued as binary uploads;
everything else is read as UTF-8 text. The job runs as a
Background Job and reports progress
the same way the OpenSearch/Elasticsearch ingests do. If the directory is
missing, isn't a directory, or isn't readable by the application user, the
job fails up front with a clear error and no documents are queued.
Re-running an import is safe: the worker dedupes by
(directory path, relative file path), so files that have already been
imported from the same directory are recorded as Skipped and not
re-enqueued.
Trust model and host allow-list¶
Data sources are administered by users with ROLE_ADMIN. The product
intentionally trusts admins to point Arbiter at internal services —
that's what the feature is for — and the Test buttons issue HTTP / JDBC
calls to whatever URL the admin types. In a typical deployment this is
fine: admins run Arbiter and have at least as much network access as the
application process.
In a multi-tenant or zero-trust deployment where the application server has access to internal hosts that the admin role should not be able to reach (the SSRF risk), there is a defense-in-depth setting that pins acceptable hosts:
# application.properties / arbiter-webapp
arbiter.data-sources.allowed-hosts=opensearch.internal,*.search.example.com
The format is a comma-separated list. Each entry is either an exact
hostname or a leading-wildcard pattern (*.foo.com) that matches one-or-
more left-side labels (so a.foo.com and a.b.foo.com both match) and
also the bare apex (foo.com).
When configured, every URL on every admin Test click and every saved-source ingest job is checked against the list. This covers OpenSearch / Elasticsearch data sources, Philter instance endpoints (Admin → Philter), and Ollama instance endpoints (Admin → LLM-as-a-Judge). A non-matching host is rejected with the error "Endpoint host is not permitted." The check applies even to already-saved instances: if a host is removed from the allow-list later, a Test click on an existing row is also rejected.
Private-range default-deny¶
Private-range addresses are blocked by default. Arbiter resolves the
configured hostname and rejects it if it resolves to a loopback address
(127.x.x.x), an RFC-1918 private range (10.x.x.x, 172.16–31.x.x,
192.168.x.x), or a link-local address (169.254.x.x). This applies to
both admin form submissions and Test clicks. Numeric IP addresses in the
URL are evaluated directly using InetAddress.getByName(...), so a
literal private IP (e.g. http://127.0.0.1:9200 or
http://169.254.169.254/) cannot bypass the check by skipping DNS
resolution.
The reason this exists by default — even before any operator has set
arbiter.data-sources.allowed-hosts — is server-side request forgery
(SSRF). The application process can usually reach private addresses
the operator does not intend to expose through admin-supplied URLs:
cloud instance-metadata endpoints (169.254.169.254 on AWS / GCP /
Azure, which can leak role credentials), MongoDB / Redis / Philter on
loopback, intranet HTTP services on RFC-1918, and other tenants on the
same VPC. The default-deny on private ranges closes that surface
without requiring any configuration.
The default-deny is not absolute. There are two ways to admit a private host:
- Add it to
arbiter.data-sources.allowed-hosts. When the property is set, Arbiter checks the resolved host against every pattern — including private addresses. Soarbiter.data-sources.allowed-hosts=opensearch.internal,192.168.1.100admits exactly those two hosts (one of which is private). Hostnames that resolve to a private address are admitted the same way, as long as the hostname matches a pattern. This is the supported way to reach a legitimate internal OpenSearch / Elasticsearch / Philter / Ollama from the data-source UI. - Disable the allow-list from Admin → Security. The master switch bypasses both the pattern check and the private-range default-deny for every outbound call. This is documented as not recommended — see Master switch (Admin → Security) below and the Security settings explanation for the SSRF rationale.
Allow-list behavior summary¶
arbiter.data-sources.allowed-hosts |
Public host | Private host |
|---|---|---|
| unset / blank (default) | accepted | rejected |
| set, host matches a pattern | accepted | accepted |
| set, host doesn't match | rejected | rejected |
Use the property to restrict which public hosts are reachable in multi-tenant or zero-trust deployments, or to admit specific private hosts (an internal OpenSearch, a bastion-routed Philter) the operator genuinely needs to reach.
Master switch (Admin → Security)¶
The whole allow-list — both the property-driven host patterns and the private-range default-deny — can be turned off from Admin → Security → Enable host allow-list. The setting is stored on the global settings document, defaults to enabled, and is read at request time, so a toggle takes effect on the next outbound check without restarting.
Disabling the master switch is documented as not recommended: it re-opens the SSRF surface that this allow-list closes. See Security settings → Data-source host allow-list for the rationale and the audit-trail entry that records each toggle.
Credential encryption¶
Every credential that admins type into the Data Sources page is encrypted with
AES-GCM before being written to MongoDB. The same scheme protects Philter API
keys — see Security · Philter API keys for
the full description (key derivation, IV format, base64 layout, and the
arbiter.crypto.secret property). The plaintext is never displayed back; the
table only shows a status (Configured / Ambient / From URL /
••••••).
The OpenSearch username is the one credential field stored as plaintext on the document row. Treat it as you would any other identifier in the database — Mongo-level encryption-at-rest still applies, but the value is visible in raw documents.
Removing a source¶
Each row has a Remove button that deletes the source after a confirmation prompt. Removal is hard-delete; nothing is moved to a trash collection.
Audit trail¶
Every change is recorded in the audit log with the actor's email and the affected source's id and name:
| Action | Resource | When fired |
|---|---|---|
OPENSEARCH_DATASOURCE_CREATE |
OpenSearchDataSource |
OpenSearch source added |
OPENSEARCH_DATASOURCE_UPDATE |
OpenSearchDataSource |
OpenSearch source edited (payload includes a passwordChanged boolean) |
OPENSEARCH_DATASOURCE_DELETE |
OpenSearchDataSource |
OpenSearch source removed |
ELASTICSEARCH_DATASOURCE_CREATE |
ElasticsearchDataSource |
Elasticsearch source added |
ELASTICSEARCH_DATASOURCE_UPDATE |
ElasticsearchDataSource |
Elasticsearch source edited |
ELASTICSEARCH_DATASOURCE_DELETE |
ElasticsearchDataSource |
Elasticsearch source removed |
S3_DATASOURCE_CREATE |
S3DataSource |
S3 source added |
S3_DATASOURCE_DELETE |
S3DataSource |
S3 source removed |
RDB_DATASOURCE_CREATE |
RelationalDbDataSource |
Relational database source added. Payload includes watermarkColumn (empty when watermarking is off). |
RDB_DATASOURCE_UPDATE |
RelationalDbDataSource |
Relational database source edited. Payload includes credentialsChanged, credentialsSet, watermarkColumn, and a watermarkReset boolean (true when the edit changed the SQL or watermark column on an already-advanced source). |
RDB_DATASOURCE_DELETE |
RelationalDbDataSource |
Relational database source removed |
RDB_WATERMARK_ADVANCE |
RelationalDbDataSource |
Auto-advanced after a successful incremental run. Payload includes from, to, and the row count processed in this run. |
RDB_WATERMARK_RESET |
RelationalDbDataSource |
Operator-initiated reset via the Reset watermark button. Payload includes the previousKey. |
RDB_WATERMARK_SET_MANUAL |
RelationalDbDataSource |
Operator-initiated manual override of the watermark value. Payload includes from and to. |
RDB_DANGEROUS_SQL_BLOCKED |
RelationalDbDataSource |
RDB source rejected for containing DELETE, TRUNCATE, or DROP in the SQL — entityId is null because nothing was saved. The payload includes the data-source name, JDBC URL, matched keywords, and the offending sqlQuery. |
RDB_DANGEROUS_JDBC_URL_BLOCKED |
RelationalDbDataSource |
RDB source rejected by JdbcUrlValidator (driver-level RCE primitive, dangerous parameter, embedded credentials, host not on the allow-list, …). entityId is null on create; on edit it carries the existing source id. |
LOCAL_DATASOURCE_CREATE |
LocalDirectoryDataSource |
Local directory source added |
LOCAL_DATASOURCE_DELETE |
LocalDirectoryDataSource |
Local directory source removed |
The *_CREATE and *_UPDATE entries record the connection details
(endpoint / bucket / URL / path, query, table, glob, plus text field
and filename field for OpenSearch and Elasticsearch) along with a
boolean credentialsSet / passwordSet flag. The encrypted credential
values themselves are never logged. The Test actions do not
produce their own audit event — they don't save anything.