Exporting annotations¶
Arbiter can export a batch's reviewed annotations to a destination as labeled data for downstream pipelines. An export runs as a background job: it reads the batch's APPROVED documents page by page (so a large batch is never loaded into memory at once) and writes the chosen data format to the destination.
Start an export from the Batches page: click Export on a batch row, then choose a destination and a data format. Exporting a batch requires the team lead for that batch or an administrator.
What is exported¶
Only the human-confirmed annotations are exported: a document's APPROVED spans.
Rejected spans and spans still awaiting a second opinion (NEEDS_SECOND_OPINION)
are excluded. A reviewed document that ended up with no approved spans is still
exported (as a negative example) so the data reflects true negatives, not just
positives.
Data formats¶
- JSONL: one JSON object per document in a single file, with an
entitiesarray and documentmetadata. - BIO: token-per-line NER format, one
.biofile per document. - PhEye: the leaner training shape used by the PhEye PII models (described below), one JSON object per document in a single file.
PhEye training format¶
The PhEye format produces JSONL with the same field names and types as the
PhEye model-training corpora (for example ph-eye-model-training's
passages.jsonl), so an export is drop-in usable for training and evaluating the
PhEye PII models.
Each line is one independent JSON object (UTF-8, no wrapping array, newlines inside
text escaped as \n):
{"text": "Contact John Smith today.", "spans": [{"start": 8, "end": 18, "label": "name"}]}
{"text": "UNITED STATES SECURITIES AND EXCHANGE COMMISSION", "spans": []}
Per document:
textis the document's original text.spansis an array with one entry per approved span:startis the inclusive character offset intotext.endis the exclusive character offset intotext, sotext.substring(start, end)is the labeled span.labelis the span's PII type, taken verbatim from the reviewed annotation (no remapping is applied).
A document with no approved spans is emitted with "spans": [].
The exported file is named <batch>-pheye-<timestamp>.jsonl so it is
distinguishable from a plain JSONL export in the same destination.