Latent Engineering · Research
Every model forces a trade between accuracy and cost. CIPHER does not. A 1.5B-parameter encoder that holds frontier-level recall at roughly one thousandth the inference cost, near 50 milliseconds per clinical document.
In plain terms: CIPHER, Latent’s de-identification model, takes the patient out of the note without changing the medicine. It matches the accuracy of the largest models, and it runs fast and cheap enough to protect every note a health system holds, not just the sample it has time to review by hand.
Read the paper97.9%
Recall on 195 expert-annotated clinical notes
1000×
Inference cost reduction versus frontier LLM baselines
~50ms
Latency per document on a single H100
Clinical de-identification is a deceptively hard Named Entity Recognition (NER) problem. The task requires detecting 18 categories of entities, from obvious identifiers like social security numbers and phone numbers to subtle cases that confound naive classifiers: eponymous medical terms that share surface form with names ("Foley catheter," "Crohn's disease"), dates in non-standard formats ("POD#3," "two Tuesdays ago"), and place names that double as clinical entities ("Manchester triage score," "Glasgow coma scale"). Identifiers also appear in diverse contexts: a given name might be the patient, a family member mentioned in social history, a provider signing a note, or even an everyday adjective ("the max value is…"). The system must distinguish between PHI-like patterns and legitimate clinical terms: is "Dr. Smith" a provider name or a medication brand?
Existing approaches force a difficult tradeoff. When optimized for the specific domain and task, large language models can identify PHI accurately, but their cost, latency, and non-determinism make them impractical at scale. Rule-based systems achieve high precision on common patterns but struggle with the linguistic variance of real clinical documents; abbreviations, misspellings, and context-dependent identifiers frequently evade detection. We needed something that combines LLM-level accuracy with the speed, cost-efficiency, and determinism that production healthcare systems demand.
We built CIPHER (Clinical Identifier Protection via Hybrid Entity Redaction) to close that gap. CIPHER uses an ensemble of LLMs to generate probability distributions over BIO labels, then distills that knowledge into a 1.5B-parameter encoder-based token classifier, achieving >98% recall on expert-labeled evaluation sets while reducing inference cost by 1000× and latency to ~50 ms per document. Our approach combines outlier-aware dataset construction that oversamples challenging cases, soft-label distillation that preserves ensemble uncertainty at ambiguous entity boundaries, and a vectorized CRF implementation for efficient structured prediction. The result is a production-ready system that delivers LLM-level accuracy with the determinism and efficiency that healthcare deployments demand.~50 ms per document is fast enough to de-identify at the moment of writing, not in an overnight batch.
Live · scroll to redact
Transplant surgery · Daily progress note · Synthetic exampleAdmission: 2026-03-11 Chief complaint: Robert Tanaka is a 57 year old male with a history of ETOH cirrhosis, s/p orthotopic liver transplant on 2026-02-20. MRN 5509134 · callback (208) 555-0136. Seen at St. Luke's Medical Center, Boise. Stable since last infusion; PA turnaround expected ~3 days under current workflow. Attending: Dr. Karen Lindgren RN: Tomoko Sato, BSN
Figure 1. A de-identified clinical note (values rotated to fake data). Scroll to move the note from raw text, through token-level BIO labels, to redacted output. Identifiers appear in diverse contexts; the system disambiguates a provider name from clinical terms.
Our pipeline operates in four stages: 1) stratify and select a representative subsample of our full distribution using lexical and semantic features, 2) manually label this subset to produce an expert-labeled gold set of high-quality examples, 3) use an ensemble of open-source, high-reasoning LLMs to scale this with soft labels, 4) train a compact encoder on this scaled silver set to distill the identification behavior. The key insight is that we can use expensive, high-quality LLM annotations at training time while deploying a fast, deterministic encoder at inference time, capturing the best of both paradigms.
Figure 2 · High-level system architecture
Curating the training data
Table 1. Note features for outlier detection.
We construct training datasets from clinical notes sourced across our partner healthcare systems, ensuring our models learn from diverse documentation styles, patient populations, and clinical workflows. A random sample would over-represent high-volume note types (e.g., progress notes) while under-sampling rare but challenging categories (e.g., operative reports, discharge summaries). Rather than accepting this bias, we deliberately oversample challenging cases. We compute robust Z-scores (using median absolute deviation) across 14 lexical and semantic features including text length, lexical diversity, PHI density, formatting patterns, corpus similarity, and regex-matched identifier counts. Descriptions of these features are included in table 1. We also stratify across document-type and organization groups, flagging notes with Z > 2.0 as outliers. The final dataset uses 90% stratified random sampling and 10% outlier oversampling, balancing representative coverage with sufficient exposure to edge cases. We then deploy a K=8 ensemble of OSS LLM annotators to produce soft labels across ~15,000 clinical notes, a labeling effort that cost ~$2,500 and a couple orders of magnitude of wall-clock time.The teacher run in one line: K=8 open models, ~15,000 notes, ~$2,500. An expert human pass at this scale costs orders of magnitude more.
Rather than taking the ensemble's majority vote as a hard label, we preserve the full vote distribution as a soft label. For each token at position t, the soft label distribution is:
where K is the number of ensemble members, vt(k) is the label assigned to token t by the k-th annotator, 𝟙[·] is the indicator function, and ℓ is the set of BIO labels. When 6 out of 8 voters agree a token is B-NAME and 2 say O, the resulting distribution [0.75, 0.25, 0, ...] encodes genuine boundary uncertainty that hard labels would discard.
8 voters, 1 ambiguous token — click any voter to flip their vote and watch the soft-label distribution update.
This is closely related to standard knowledge distillation[3], where a student trains on a teacher's softmax outputs rather than hard labels — but rather than temperature-scaling a single model's logits, our soft targets arise naturally from ensemble disagreement. Ensemble vote distributions are better calibrated[4], allow us to follow diverse reasoning trajectories, isolate genuine model disagreement from inherent ambiguity, and surface correlated failure modes that no single reasoning path can reveal.
Our student combines a Stella 1.5B encoder[9], a bidirectional transformer based on Qwen2-1.5B[8], converted via the LLM2Vec approach[2]. The causal attention mask is replaced with a fully bidirectional mask with a classification head and a Conditional Random Field (CRF) layer. The CRF models the joint probability of the entire label sequence rather than making independent per-token predictions:
where the score function is:
and E ∈ ℝT × |ℓ| are the emission scores from the classification head, and A ∈ ℝ|ℓ| × |ℓ| is the learned transition matrix where Aij represents the score for transitioning from PHI label i to label j.
Unlike independent token classifiers, the CRF models the conditional probability of the entire label sequence, learning transition probabilities that capture dependencies between adjacent labels. This enforces structural constraints that per-token classification cannot — an I-NAME tag should only follow B-NAME or I-NAME, never appear after O or a different entity type. At inference, we introduce a vectorized batched Viterbi algorithm that processes all sequences in a batch simultaneously, reducing GPU-CPU synchronizations from O(B·T) to O(T) and achieving 20-50× speedup over naive sequential decoding.
Figure 3. Encoder + CRF architecture. Input text is tokenized and embedded, encoded by a bidirectional Stella 1.5B, projected through a classification head (Linear 1536→1536 · GELU · Dropout · Linear 1536→39) to per-token emission scores E, then decoded by a CRF layer with a learned transition matrix A via a vectorized Viterbi pass into BIO predictions y.
Interactive · Transition matrix A · what the CRF allows
Hover any cell to see whether the CRF permits that label transition. The matrix encodes structural constraints — I-NAME can only follow B-NAME or I-NAME; it can never follow O.
Training objective
Training combines two complementary losses:
The distillation term is a focal-modulated soft cross-entropy that addresses the severe class imbalance between PHI and non-PHI tokens. The vast majority of tokens are non-PHI, and unmodulated cross-entropy would let the model minimize loss by confidently predicting "O" everywhere. We apply focal modulation[5]:
Interactive · Focal modulation in motion
As γ increases, the modulating factor (1 − p̂)γ down-weights confident predictions, concentrating gradient signal on ambiguous tokens. Drag the slider.
With γ=2.0, a token classified with 95% confidence contributes 400× less to the loss than one at 50%, forcing the model to spend its capacity on ambiguous tokens near entity boundaries rather than easy non-PHI tokens in the middle of sentences. The CRF loss operates on hard labels derived from the soft distribution (ŷ = arg max p) and penalizes structurally invalid sequences:
where the numerator scores the correct label sequence and the denominator marginalizes over all possible sequences via the forward algorithm. Together, the focal distillation loss trains the model to classify individual tokens accurately while the CRF loss ensures coherent entity sequences.
"The student surpassing its teacher is a known property of distillation: the ensemble's soft labels encode richer signal than any individual member provides."
— § 03 · Findings
CIPHER matches LLM-level accuracy at a fraction of the cost. On our held-out set of ~195 expert-annotated clinical notes, CIPHER achieves 97.9% recall and 93.85% F1, which actually exceeds its teacher LLM (GPT-OSS-120B) on recall (97.9% vs 97.5%) and trails the best frontier model by only 0.2%. The student surpassing its teacher is a known property of distillation: the ensemble's soft labels encode richer signal than any individual member provides. Critically, inference drops from seconds per document to ~50 ms, a 1000× cost reduction that makes processing these documents tractable.97.9% recall, measured on 195 expert-annotated notes. The student exceeds its own teacher on recall, a known property of distillation.
Figure 4. CIPHER vs. decoder-based approaches: de-identification recall vs. cost per note for CIPHER and frontier LLM baselines on 195 expert-annotated clinical notes (hover any point for recall, F1, and cost). Frontier models (gpt-5.4-nano, gpt-5.4-mini, gpt-5.5) are shown with single-shot (k=1, filled) and 5-voter ensemble (k=5; majority vote, hollow) configurations. GPT-OSS-120B (CIPHER's teacher) is shown for reference. Cost is log-scaled, computed from per-token pricing (LLMs) or H100 compute time (CIPHER).
Soft labels are the single most impactful component.
| Configuration | Recall | Precision | F1 |
|---|---|---|---|
| Full system | 97.9% | 89.1% | 93.85% |
| w/o CRF | 96.65% | 89.04% | 92.69% |
| w/o focal loss | 96.96% | 90.53% | 93.64% |
| w/o soft labels | 97.43% | 84.31% | 90.40% |
Table 2. Architecture ablation.
Removing soft labels and training on hard argmax labels instead drops F1 by 3.45 points — precision falls by 5% as the model, lacking information about inter-class similarity and uncertainty, treats every teacher prediction with equal confidence and aggressively labels borderline tokens as PHI. The CRF contributes +1.16% F1 through structural consistency, primarily benefiting multi-token entities like addresses and full names. Focal loss provides a modest recall boost at a small precision cost — consistent with its design intent of pushing the model toward catching more PHI. The lesson: capturing teacher uncertainty matters more than any individual architectural choice.
Scaling behavior is log-linear with diminishing returns.
Figure 5. Scaling behavior. Recall, precision, and F1 as a function of training set size. The dashed line marks the ~98% recall ceiling where returns diminish.
Recall scales from 86.7% at 500 notes to 97.7% at 15,000, with most of the relative gain coming early: the first 2,000 notes improve recall by ~8 points. This suggests ~5,000 notes captures most of the recall ceiling, making our choice of 15,000 conservative but thorough.
Real data beats synthetic data — decisively. We evaluated mixtures of synthetic clinical notes (Nemotron-PII[6]) and authentic notes, holding total dataset size fixed at 10,000.
| Synthetic % | Authentic % | Auth. Recall | Auth. F1 | Synth. Recall | Synth. F1 |
|---|---|---|---|---|---|
| 100% | 0 | 78.2% | 76.5% | 96.8% | 95.1% |
| 75% | 25% | 85.6% | 83.9% | 94.1% | 92.7% |
| 50% | 50% | 91.3% | 89.8% | 94.5% | 92.0% |
| 25% | 75% | 96.1% | 93.2% | 97.2% | 93.8% |
| 0% | 100% | 97.4% | 93.6% | 95.9% | 94.3% |
Table 3. Performance by data mixture.
A model trained entirely on synthetic data achieves 96.8% recall on synthetic test data but only 78.2% on authentic clinical notes — a 19-point gap. Each 25% shift toward authentic data closes this gap steeply, with authentic eval recall climbing from 78.2% to 97.4%. The model trained on 100% authentic data actually achieves the highest recall on both evaluation sets, outperforming even the model trained entirely on synthetic data on its own test set. The takeaway for clinical NLP broadly: synthetic data can demonstrate that a method works in principle, but the distribution mismatch between generated and authentic text translates directly into missed PHI at inference time.
De-identification must preserve clinical meaning.
De-identification preserves clinical meaning. A critical concern with any de-identification system is whether it preserves the clinical substance of the original text. Removing identifiers provides little value if the process also strips away the context and nuance that make the original document informative.
| Metric | Value |
|---|---|
| Paired cosine similarity | 0.985 |
| KNN recall (k=10) | 0.956 |
| Wasserstein distance | 0.064 |
Table 4. Preservation of clinical semanticity.
We embed each note before and after de-identification using Clinical ModernBERT[7] and measure semantic preservation along three axes. Paired cosine similarity is 0.9855, which shows that de-identification shifts the average note's embedding by less than 0.5%. KNN recall (k=10) is 0.956, meaning 95.6% of neighbor relationships survive de-identification. Wasserstein distance is 0.064, well below the 0.1 threshold, indicating no systematic distributional distortion. These results validate a core advantage of our token-level approach over generative de-identification methods[1]: because we identify and mask specific character spans without touching surrounding text, we cannot round a lab value, compress a symptom timeline, or drop a negation. Clinical meaning is preserved by construction.
CIPHER demonstrates that knowledge distillation can transfer frontier-level PHI detection to a 1.5B-parameter encoder at 1000× lower inference cost, achieving 97.9% recall at ~50 ms per document.
Our ablations show that the choice of training signal matters more than architecture. Soft labels contributed the largest improvement (+3.45 F1), encoding boundary uncertainty and inter-class similarity that hard labels discard. The CRF adds structural consistency (+1.16% F1) by enforcing valid BIO transitions, and focal loss shifts gradient signal toward rare PHI tokens at a modest precision cost. The synthetic data experiments reinforce that models need to train on the distribution they will serve — the 19-point recall gap between synthetic-only and authentic-only training is a statement about distributional fidelity, not data quantity.
Our semantic preservation metrics (0.985 cosine similarity, 0.956 KNN recall) confirm that token-level de-identification maintains clinical meaning by construction. CIPHER is one step in our broader mission of delivering personalized, provider-grade care for every patient.
One last thing
The model you just read about treats every document the same way. Including this one.
About this demoFlip the switch and the paper applies CIPHER’s treatment to its own text. Names, dates, and organizations (3 of HIPAA’s 18 identifier categories) become redaction blocks, while every finding and number stays readable. Hover or tap any block to peek at what it was. It runs as a few patterns in this page’s script: a demonstration of the idea, not the model itself.Continue the conversation.