Full text search¶
Full text search is the optional capability that lets reviewers find documents
by their content. When enabled, every document Arbiter ingests is also pushed
into an OpenSearch cluster, and the Search page in the sidebar plus
the GET /api/v1/search endpoint serve queries from
that index.
The feature is enabled by default but the connection details to the OpenSearch cluster live entirely in the database — there is no environment variable to set on first startup.
The settings panel lives under Admin → Settings on the General tab, inside the Full text search card.
What full text search is used for¶
- The Search page in the sidebar runs a full-text match over every
document the caller is allowed to see. Results outside the caller's group
visibility return as
restricted: truewith content fields nulled, so the caller knows a hit exists without seeing what it is. - The
GET /api/v1/searchREST endpoint serves the same query for programmatic clients. - The Find similar documents button on the document review page runs
OpenSearch's
more_like_thisquery to surface documents in the same batch whose text resembles the one currently open. The button is hidden on the review page when full text search is disabled.
What full text search is not used for: it does not drive the queue, status filtering, dual-approval evaluation, audit log, or any of the review-decision pipeline. Disabling full text search affects discovery only; ingest, redaction, review, and finalize all continue to function normally.
Enabling and configuring full text search¶
The Full text search card carries five fields:
| Field | Required | Notes |
|---|---|---|
| Enabled | yes | Master on/off switch. Enabled by default for new and legacy installs. |
| Endpoint | yes | Base URL of the OpenSearch cluster, including scheme and port (e.g. http://opensearch:9200). |
| Index name | yes | Lower-case letters, digits, -, _, +. Defaults to arbiter-documents. |
| Username | no | Optional HTTP basic-auth username for clusters that require it. |
| Password | no | Optional basic-auth password. Encrypted at rest with AES-GCM (the same scheme as data-source credentials). Leave blank to keep the previously stored value; tick Clear stored password to wipe it back to none. |
Click Save. Arbiter immediately tries to reach the cluster and one of four things happens:
- The index does not exist — Arbiter creates it with the canonical
mapping (one
textfield onoriginalTextfor full-text matching, plus keyword fields onid,batchId,filename, andstatusfor filters) and the settings are saved. A success banner confirms the new index. - The index exists with the canonical mapping — settings are saved without modifying the index. A success banner confirms the reuse.
- The index exists but its mapping differs from canonical — settings
are not saved yet. A yellow callout appears below the form with a
side-by-side diff of the existing and expected mappings, and two buttons:
- Continue with existing index — saves the settings as submitted and
uses the existing index as-is. Search may still work but documents
that relied on the canonical mapping (e.g.
originalTextindexed for full-text matching) may not surface as expected. Recommended only when you know exactly why the mapping differs and want to keep it. - Cancel — discards the submission. The form is reset and nothing is saved.
- Continue with existing index — saves the settings as submitted and
uses the existing index as-is. Search may still work but documents
that relied on the canonical mapping (e.g.
- OpenSearch is unreachable, or returns a server error — settings are not saved and an error banner explains why. Fix the connection (or auth) and try again. The probe is intentionally a hard gate when the feature is being turned on — saving an endpoint that isn't reachable would only produce a steady stream of failed indexing calls.
The save event is recorded in the audit log as a
GENERAL_SETTINGS_CHANGE row carrying the previous and new values
(passwords are never logged, only a passwordChanged boolean).
Choosing an index name¶
A single OpenSearch cluster can host multiple deployments by giving each
its own index. Pick a name that is descriptive enough to identify the
deployment when looking at the cluster directly — e.g. arbiter-staging,
arbiter-prod-redactions. The index is created on first save; renaming it
later writes a brand-new empty index, so the search page will be empty
until enough new documents are ingested.
Mapping comparison rules¶
Arbiter compares only the canonical fields that the search code reads back
(id, batchId, filename, status, originalText). Extra fields on the
existing index — added by hand, by another tenant, by a previous Arbiter
version — are tolerated. A mapping is reported as a mismatch only when:
- A canonical field is missing entirely from the existing index.
- A canonical field exists but with a different
type(for example, anoriginalTextfield that's been declared askeywordrather thantextcannot serve full-text match queries).
This loose comparison keeps the mismatch dialog rare in practice while still catching the differences that actually break search.
Disabling full text search¶
Untick Enable on the same form and click Save. Disabling does not require the cluster to be reachable — Arbiter writes the setting and skips the OpenSearch probe entirely. This means you can turn the feature off even if the OpenSearch cluster is currently down.
When disabled:
- New ingests do not push the document text to OpenSearch. The text still lives in MongoDB exactly as it did before, and review continues to work normally.
- The Search page returns empty results (an informational note explains
why). The
GET /api/v1/searchendpoint behaves the same way. - The Find similar documents button on the review page is not rendered — reviewers won't see it at all rather than getting a button that produces empty results.
- The OpenSearch cluster, if you still have one running, is left untouched. No documents are deleted; previously-indexed documents stay in the index so a future re-enable can use them as-is (subject to the mapping comparison above).
Implications of disabling¶
- Search-by-text becomes unavailable. Reviewers can still find documents by batch, by status, by priority, by filename (the queue page filters all do this without OpenSearch), and by document ID. They cannot run free-text matches over the body of every document.
- Find similar documents goes away. This is the only feature that surfaces textually-related documents in the same batch.
- No new documents index. If you re-enable later, the gap between disable and re-enable will not be back-filled — only documents ingested after re-enable show up in search results. Run a re-index against the cluster yourself if you need full coverage.
- Reports, audit log, and the review queue are unchanged. None of those depend on OpenSearch.
Disabling is the right choice when:
- You don't have an OpenSearch cluster available and don't want indexing
attempts to flap against
localhost:9200. - Your deployment policy disallows indexing PII-bearing document text in a second store. Note that with full text search enabled, the unredacted source text is sent to OpenSearch in plaintext (see Security considerations below). If you cannot satisfy the precautions listed there, leave the feature off.
- You're temporarily isolating the deployment from the OpenSearch cluster for maintenance and want clean signals (no warning logs about indexing failures).
Security considerations¶
PII is indexed in OpenSearch when this feature is enabled
Enabling full text search causes the complete, unredacted source text of every ingested document to be written to the configured OpenSearch cluster — including any PII that document contains (names, email addresses, SSNs, phone numbers, account numbers, medical details, …). This is the design: OpenSearch needs the raw text to power the full-text match query.
The OpenSearch index sits outside the at-rest encryption boundary
that covers MongoDB. The
SymmetricCipher field-encryption callbacks
encrypt PII inside MongoDB before it is written to disk; no equivalent
transformation runs on the OpenSearch indexing path. Documents go
over the wire to OpenSearch as plaintext and are stored as plaintext.
Treat the OpenSearch cluster, the network path to it, the basic-auth credentials configured below, the disks and snapshots backing it, and any backups taken from it as part of the same security boundary as your MongoDB instance. Anything less and the search index becomes the weakest point in your PII handling.
Concretely, the precautions you need to apply to the OpenSearch cluster when full text search is enabled:
- Encrypt the OpenSearch data directory at rest. Use disk-level encryption (LUKS, dm-crypt, EBS encryption, Azure Disk Encryption, GCP CMEK) on every node in the cluster. The same volume and the same key management posture you give your MongoDB data directory.
- Encrypt OpenSearch backups and snapshots. Snapshots written to S3 or another object store carry the full document text. Encrypt the destination bucket and rotate keys on the same schedule as your other PII backups.
- Encrypt the network path. Use HTTPS for the endpoint
(
https://in the field above) so the document text isn't exposed on the wire between Arbiter and OpenSearch. Ahttp://endpoint is fine only if Arbiter and OpenSearch are co-located on a trusted network segment that isn't shared with anything else — a single Docker bridge network on the same host, for example. - Require authentication. Configure OpenSearch's security plugin (or your reverse proxy) to require basic auth, and enter the username and password into this form. Arbiter encrypts the password at rest with the same AES-GCM scheme used for data-source credentials. Anonymous access to the cluster equals anonymous access to every PII document Arbiter has ever ingested.
- Scope the credential to least privilege. The user Arbiter logs in
as needs only:
indices:data/write/indexandindices:admin/create/indices:admin/geton the configured index name (so Arbiter can create the index on first save and write documents on every ingest).indices:data/read/searchon the same index (so the Search page can run match andmore_like_thisqueries). Avoid cluster-wide admin or all-indices grants.
- Restrict network reachability. Bind OpenSearch to a private network
interface; don't expose
9200on a public IP. Lock the firewall down to the subnets that Arbiter runs on. - Audit OpenSearch access independently. Arbiter writes
DOCUMENT_PII_SENT_TO_LLMrows when text leaves for an LLM, but it does not write a per-document audit row for every OpenSearch index operation (those happen on every ingest and would flood the audit log). Rely on OpenSearch's own audit logging or your network capture tooling to attribute reads of the index. - Apply the same retention policy as MongoDB. When a batch is finalized and a finalization policy purges the source text from MongoDB, the OpenSearch document is not automatically deleted. Either disable full text search for batches under a delete-immediately policy, or run a periodic reindex/cleanup job against OpenSearch that honors the same retention rules.
If your deployment policy does not allow PII to live in a second store even with all of the above applied, disable full text search (see Disabling full text search above). With the feature off, the document text never leaves MongoDB.
See Security · PII at rest in MongoDB for the broader at-rest story; the section there explicitly calls out OpenSearch as needing its own encryption.
Troubleshooting¶
- "Could not reach OpenSearch at … : Connection refused" — the cluster is down, or the endpoint URL is wrong. Double-check the scheme and port, and try again.
- "OpenSearch returned HTTP 401" — the cluster requires authentication but the credentials supplied are missing or wrong. Re-enter the username and password and re-save.
- Mapping mismatch dialog every time — something is creating the index
with a non-canonical mapping (a templated index, an external indexer, or
manual
curlsetup). Either delete the index out-of-band so Arbiter can recreate it, or click Continue with existing index and accept the divergence. - Search page is empty even though documents exist — the feature was disabled at some point during ingestion. Re-enable, then either re-index against the cluster yourself or accept that only post-re-enable documents will appear.