v1.0 · 2026-05-07

Latent Engineering · Research

Pushing the Pareto Frontier for Clinical Deidentification.

Every model forces a trade between accuracy and cost. CIPHER does not. A 1.5B-parameter encoder that holds frontier-level recall at roughly one thousandth the inference cost, near 50 milliseconds per clinical document.

In plain terms: CIPHER, Latent’s de-identification model, takes the patient out of the note without changing the medicine. It matches the accuracy of the largest models, and it runs fast and cheap enough to protect every note a health system holds, not just the sample it has time to review by hand.

Read the paper

The cost / accuracy frontier. CIPHER sits where nothing else does.

Manuscript: M–08–26
Section: Engineering · Research
Subject: Privacy infrastructure
Status: Published · v1.0
Filed: May 7, 2026
Live

Authors: Liam Sullivan/Dilip Thiagarajan/Allan Bishop
Read time: ~12 min
Footnotes: 9

97.9%

Recall on 195 expert-annotated clinical notes

1000×

Inference cost reduction versus frontier LLM baselines

~50ms

Latency per document on a single H100

§ 01 · Context

The Scale of the Problem.

Clinical de-identification is a deceptively hard Named Entity Recognition (NER) problem. The task requires detecting 18 categories of entities, from obvious identifiers like social security numbers and phone numbers to subtle cases that confound naive classifiers: eponymous medical terms that share surface form with names ("Foley catheter," "Crohn's disease"), dates in non-standard formats ("POD#3," "two Tuesdays ago"), and place names that double as clinical entities ("Manchester triage score," "Glasgow coma scale"). Identifiers also appear in diverse contexts: a given name might be the patient, a family member mentioned in social history, a provider signing a note, or even an everyday adjective ("the max value is…"). The system must distinguish between PHI-like patterns and legitimate clinical terms: is "Dr. Smith" a provider name or a medication brand?

Existing approaches force a difficult tradeoff. When optimized for the specific domain and task, large language models can identify PHI accurately, but their cost, latency, and non-determinism make them impractical at scale. Rule-based systems achieve high precision on common patterns but struggle with the linguistic variance of real clinical documents; abbreviations, misspellings, and context-dependent identifiers frequently evade detection. We needed something that combines LLM-level accuracy with the speed, cost-efficiency, and determinism that production healthcare systems demand.

We built CIPHER (Clinical Identifier Protection via Hybrid Entity Redaction) to close that gap. CIPHER uses an ensemble of LLMs to generate probability distributions over BIO labels, then distills that knowledge into a 1.5B-parameter encoder-based token classifier, achieving >98% recall on expert-labeled evaluation sets while reducing inference cost by 1000× and latency to ~50 ms per document. Our approach combines outlier-aware dataset construction that oversamples challenging cases, soft-label distillation that preserves ensemble uncertainty at ambiguous entity boundaries, and a vectorized CRF implementation for efficient structured prediction. The result is a production-ready system that delivers LLM-level accuracy with the determinism and efficiency that healthcare deployments demand.~50 ms per document is fast enough to de-identify at the moment of writing, not in an overnight batch.

Live · scroll to redact

From raw note to structured BIO labels to redacted output, in three scrolls.

Transplant surgery · Daily progress note · Synthetic exampleAdmission: 2026-03-11

Chief complaint: Robert Tanaka is a 57 year old male with a history of ETOH cirrhosis, s/p orthotopic liver transplant on 2026-02-20.
MRN 5509134 · callback (208) 555-0136.

Seen at St. Luke's Medical Center, Boise. Stable since last infusion; PA turnaround expected ~3 days under current workflow.

Attending: Dr. Karen Lindgren
RN: Tomoko Sato, BSN

Name Date Age · MRN · Phone Org · Loc

Figure 1. A de-identified clinical note (values rotated to fake data). Scroll to move the note from raw text, through token-level BIO labels, to redacted output. Identifiers appear in diverse contexts; the system disambiguates a provider name from clinical terms.

§ 02 · Method

Teaching a Small Model to See PHI.

Our pipeline operates in four stages: 1) stratify and select a representative subsample of our full distribution using lexical and semantic features, 2) manually label this subset to produce an expert-labeled gold set of high-quality examples, 3) use an ensemble of open-source, high-reasoning LLMs to scale this with soft labels, 4) train a compact encoder on this scaled silver set to distill the identification behavior. The key insight is that we can use expensive, high-quality LLM annotations at training time while deploying a fast, deterministic encoder at inference time, capturing the best of both paradigms.

Figure 2 · High-level system architecture

01 Dataset construction Outlier-aware sampling · 14 features · Z > 2.0 oversampled

→curated notes

02 LLM ensemble annotation Soft labels from voting · K = 8 · ~15,000 notes

→annotated set

03 Student training Encoder + CRF · Stella 1.5B · ~50 ms / doc

Curating the training data

FeatureDescription

text_lengthCharacter count of the note

unique_tokensNumber of unique tokens (tiktoken cl100k_base)

type_token_ratioLexical diversity: unique words / total words

compression_ratiogzip compressed size / raw size (lower = more repetitive)

digit_shareFraction of characters that are digits

longest_numeric_runLength of longest consecutive digit sequence

non_letter_symbol_pctPercentage of non-alphabetic characters

mean_corpus_similarityMean TF-IDF cosine similarity to other notes

regex_counts__dateCount of date patterns (MM/DD/YYYY, YYYY-MM-DD, etc.)

regex_counts__phoneCount of phone patterns (XXX-XXX-XXXX)

regex_counts__ssnCount of SSN patterns (XXX-XX-XXXX)

regex_counts__emailCount of email patterns

regex_counts__mrnCount of MRN patterns ("MRN: 12345")

regex_counts__addressCount of street address patterns

Table 1. Note features for outlier detection.

We construct training datasets from clinical notes sourced across our partner healthcare systems, ensuring our models learn from diverse documentation styles, patient populations, and clinical workflows. A random sample would over-represent high-volume note types (e.g., progress notes) while under-sampling rare but challenging categories (e.g., operative reports, discharge summaries). Rather than accepting this bias, we deliberately oversample challenging cases. We compute robust Z-scores (using median absolute deviation) across 14 lexical and semantic features including text length, lexical diversity, PHI density, formatting patterns, corpus similarity, and regex-matched identifier counts. Descriptions of these features are included in table 1. We also stratify across document-type and organization groups, flagging notes with Z > 2.0 as outliers. The final dataset uses 90% stratified random sampling and 10% outlier oversampling, balancing representative coverage with sufficient exposure to edge cases. We then deploy a K=8 ensemble of OSS LLM annotators to produce soft labels across ~15,000 clinical notes, a labeling effort that cost ~$2,500 and a couple orders of magnitude of wall-clock time.The teacher run in one line: K=8 open models, ~15,000 notes, ~$2,500. An expert human pass at this scale costs orders of magnitude more.

The Teacher: a recall-optimized LLM ensemble.

Rather than taking the ensemble's majority vote as a hard label, we preserve the full vote distribution as a soft label. For each token at position t, the soft label distribution is:

p_t(ℓ) = 1K ΣKk=1 𝟙[v_t^(k) = ℓ]

(1)

where K is the number of ensemble members, v_t^(k) is the label assigned to token t by the k-th annotator, 𝟙[·] is the indicator function, and ℓ is the set of BIO labels. When 6 out of 8 voters agree a token is B-NAME and 2 say O, the resulting distribution [0.75, 0.25, 0, ...] encodes genuine boundary uncertainty that hard labels would discard.

8 voters, 1 ambiguous token — click any voter to flip their vote and watch the soft-label distribution update.

B-NAME

0.750

O

0.250

…rest

0.000

Entropy 0.811 | Hard label (argmax) B-NAME | Confidence 75.0 %

This is closely related to standard knowledge distillation[3], where a student trains on a teacher's softmax outputs rather than hard labels — but rather than temperature-scaling a single model's logits, our soft targets arise naturally from ensemble disagreement. Ensemble vote distributions are better calibrated[4], allow us to follow diverse reasoning trajectories, isolate genuine model disagreement from inherent ambiguity, and surface correlated failure modes that no single reasoning path can reveal.

The Student: Encoder + CRF.

Our student combines a Stella 1.5B encoder[9], a bidirectional transformer based on Qwen2-1.5B[8], converted via the LLM2Vec approach[2]. The causal attention mask is replaced with a fully bidirectional mask with a classification head and a Conditional Random Field (CRF) layer. The CRF models the joint probability of the entire label sequence rather than making independent per-token predictions:

P(y | x) = exp(s(x, y)) Σ_y′∈Y exp(s(x, y′))

(2)

where the score function is:

s(x, y) = ΣTt=1 E[t, y_t] + ΣTt=2 A[y_t-1, y_t]

(3)

and E ∈ ℝ^{T × |ℓ|} are the emission scores from the classification head, and A ∈ ℝ^{|ℓ| × |ℓ|} is the learned transition matrix where A_ij represents the score for transitioning from PHI label i to label j.

Unlike independent token classifiers, the CRF models the conditional probability of the entire label sequence, learning transition probabilities that capture dependencies between adjacent labels. This enforces structural constraints that per-token classification cannot — an I-NAME tag should only follow B-NAME or I-NAME, never appear after O or a different entity type. At inference, we introduce a vectorized batched Viterbi algorithm that processes all sequences in a batch simultaneously, reducing GPU-CPU synchronizations from O(B·T) to O(T) and achieving 20-50× speedup over naive sequential decoding.

Tokens
x

Pt

:

Robert

M.

Tanaka

,

DOB

02/20/26

Encoder
Stella 1.5B

h₁

h₂

h₃

h₄

h₅

h₆

h₇

h₈

classification head: Linear 1536→1536 · GELU · Dropout · Linear 1536→39

Emissions
E

.04

.02

.81

.74

.78

.05

.71

.84

CRF layer + transition matrix A ∈ ℝ^|ℓ|×|ℓ|, vectorized Viterbi decode ( I-NAME may follow B-NAME or I-NAME ; never follows O )

BIO
y

O

B-NAME

I-NAME

O

B-DATE

I-DATE

Figure 3. Encoder + CRF architecture. Input text is tokenized and embedded, encoded by a bidirectional Stella 1.5B, projected through a classification head (Linear 1536→1536 · GELU · Dropout · Linear 1536→39) to per-token emission scores E, then decoded by a CRF layer with a learned transition matrix A via a vectorized Viterbi pass into BIO predictions y.

Interactive · Transition matrix A · what the CRF allows

Hover any cell to see whether the CRF permits that label transition. The matrix encodes structural constraints — I-NAME can only follow B-NAME or I-NAME; it can never follow O.

VALID transition INVALID

Hover a cell — rows are previous labels, columns are next labels.

Training objective

Training combines two complementary losses:

ℒ = α · ℒ_distill + (1 − α) · ℒ_CRF

(4)

The distillation term is a focal-modulated soft cross-entropy that addresses the severe class imbalance between PHI and non-PHI tokens. The vast majority of tokens are non-PHI, and unmodulated cross-entropy would let the model minimize loss by confidently predicting "O" everywhere. We apply focal modulation[5]:

ℒ_distill = −Σ_t (1 − p̂_t)^γ · Σ_ℓ p_t(ℓ) log p̂_t(ℓ)

(5)

Interactive · Focal modulation in motion

As γ increases, the modulating factor (1 − p̂)^γ down-weights confident predictions, concentrating gradient signal on ambiguous tokens. Drag the slider.

γ = 0 2.0

At γ = 2.0: a confident token (p̂ = 0.95) contributes 400× less to the loss than an uncertain one (p̂ = 0.50).

With γ=2.0, a token classified with 95% confidence contributes 400× less to the loss than one at 50%, forcing the model to spend its capacity on ambiguous tokens near entity boundaries rather than easy non-PHI tokens in the middle of sentences. The CRF loss operates on hard labels derived from the soft distribution (ŷ = arg max p) and penalizes structurally invalid sequences:

ℒ_CRF = −log exp(s(x, ŷ)) Σ_y′ exp(s(x, y′))

(6)

where the numerator scores the correct label sequence and the denominator marginalizes over all possible sequences via the forward algorithm. Together, the focal distillation loss trains the model to classify individual tokens accurately while the CRF loss ensures coherent entity sequences.

"The student surpassing its teacher is a known property of distillation: the ensemble's soft labels encode richer signal than any individual member provides."

— § 03 · Findings

§ 03 · Findings

What Worked — and what didn't.

CIPHER matches LLM-level accuracy at a fraction of the cost. On our held-out set of ~195 expert-annotated clinical notes, CIPHER achieves 97.9% recall and 93.85% F1, which actually exceeds its teacher LLM (GPT-OSS-120B) on recall (97.9% vs 97.5%) and trails the best frontier model by only 0.2%. The student surpassing its teacher is a known property of distillation: the ensemble's soft labels encode richer signal than any individual member provides. Critically, inference drops from seconds per document to ~50 ms, a 1000× cost reduction that makes processing these documents tractable.97.9% recall, measured on 195 expert-annotated notes. The student exceeds its own teacher on recall, a known property of distillation.

CIPHER GPT-OSS-120B (teacher) Frontier LLMs (k=1 filled, k=5 hollow) Hover any point for recall, F1 & cost

Figure 4. CIPHER vs. decoder-based approaches: de-identification recall vs. cost per note for CIPHER and frontier LLM baselines on 195 expert-annotated clinical notes (hover any point for recall, F1, and cost). Frontier models (gpt-5.4-nano, gpt-5.4-mini, gpt-5.5) are shown with single-shot (k=1, filled) and 5-voter ensemble (k=5; majority vote, hollow) configurations. GPT-OSS-120B (CIPHER's teacher) is shown for reference. Cost is log-scaled, computed from per-token pricing (LLMs) or H100 compute time (CIPHER).

Soft labels are the single most impactful component.

Configuration	Recall	Precision	F1
Full system	97.9%	89.1%	93.85%
w/o CRF	96.65%	89.04%	92.69%
w/o focal loss	96.96%	90.53%	93.64%
w/o soft labels	97.43%	84.31%	90.40%

Table 2. Architecture ablation.

Removing soft labels and training on hard argmax labels instead drops F1 by 3.45 points — precision falls by 5% as the model, lacking information about inter-class similarity and uncertainty, treats every teacher prediction with equal confidence and aggressively labels borderline tokens as PHI. The CRF contributes +1.16% F1 through structural consistency, primarily benefiting multi-token entities like addresses and full names. Focal loss provides a modest recall boost at a small precision cost — consistent with its design intent of pushing the model toward catching more PHI. The lesson: capturing teacher uncertainty matters more than any individual architectural choice.

Scaling behavior is log-linear with diminishing returns.

Recall Precision F1 Hover any data point

Figure 5. Scaling behavior. Recall, precision, and F1 as a function of training set size. The dashed line marks the ~98% recall ceiling where returns diminish.

Recall scales from 86.7% at 500 notes to 97.7% at 15,000, with most of the relative gain coming early: the first 2,000 notes improve recall by ~8 points. This suggests ~5,000 notes captures most of the recall ceiling, making our choice of 15,000 conservative but thorough.

Real data beats synthetic data — decisively. We evaluated mixtures of synthetic clinical notes (Nemotron-PII[6]) and authentic notes, holding total dataset size fixed at 10,000.

Synthetic %	Authentic %	Auth. Recall	Auth. F1	Synth. Recall	Synth. F1
100%	0	78.2%	76.5%	96.8%	95.1%
75%	25%	85.6%	83.9%	94.1%	92.7%
50%	50%	91.3%	89.8%	94.5%	92.0%
25%	75%	96.1%	93.2%	97.2%	93.8%
0%	100%	97.4%	93.6%	95.9%	94.3%

Table 3. Performance by data mixture.

A model trained entirely on synthetic data achieves 96.8% recall on synthetic test data but only 78.2% on authentic clinical notes — a 19-point gap. Each 25% shift toward authentic data closes this gap steeply, with authentic eval recall climbing from 78.2% to 97.4%. The model trained on 100% authentic data actually achieves the highest recall on both evaluation sets, outperforming even the model trained entirely on synthetic data on its own test set. The takeaway for clinical NLP broadly: synthetic data can demonstrate that a method works in principle, but the distribution mismatch between generated and authentic text translates directly into missed PHI at inference time.

De-identification must preserve clinical meaning.

De-identification preserves clinical meaning. A critical concern with any de-identification system is whether it preserves the clinical substance of the original text. Removing identifiers provides little value if the process also strips away the context and nuance that make the original document informative.

Metric	Value
Paired cosine similarity	0.985
KNN recall (k=10)	0.956
Wasserstein distance	0.064

Table 4. Preservation of clinical semanticity.

We embed each note before and after de-identification using Clinical ModernBERT[7] and measure semantic preservation along three axes. Paired cosine similarity is 0.9855, which shows that de-identification shifts the average note's embedding by less than 0.5%. KNN recall (k=10) is 0.956, meaning 95.6% of neighbor relationships survive de-identification. Wasserstein distance is 0.064, well below the 0.1 threshold, indicating no systematic distributional distortion. These results validate a core advantage of our token-level approach over generative de-identification methods[1]: because we identify and mask specific character spans without touching surrounding text, we cannot round a lab value, compress a symptom timeline, or drop a negation. Clinical meaning is preserved by construction.

§ 04 · Outlook

What's next.

CIPHER demonstrates that knowledge distillation can transfer frontier-level PHI detection to a 1.5B-parameter encoder at 1000× lower inference cost, achieving 97.9% recall at ~50 ms per document.

Our ablations show that the choice of training signal matters more than architecture. Soft labels contributed the largest improvement (+3.45 F1), encoding boundary uncertainty and inter-class similarity that hard labels discard. The CRF adds structural consistency (+1.16% F1) by enforcing valid BIO transitions, and focal loss shifts gradient signal toward rare PHI tokens at a modest precision cost. The synthetic data experiments reinforce that models need to train on the distribution they will serve — the 19-point recall gap between synthetic-only and authentic-only training is a statement about distributional fidelity, not data quantity.

Our semantic preservation metrics (0.985 cosine similarity, 0.956 KNN recall) confirm that token-level de-identification maintains clinical meaning by construction. CIPHER is one step in our broader mission of delivering personalized, provider-grade care for every patient.

References

[1]Aghakasiri, Z., et al. (2025). Not what the doctor ordered: A survey on LLM-based clinical de-identification. In Proceedings of EMNLP 2025. https://arxiv.org/abs/2509.14464↑

[2]BehnamGhader, P., Adlakha, V., Mosbach, M., Baez, D., Muennighoff, N., & Srivastava, S. (2024). LLM2Vec: Large language models are secretly powerful text encoders. arXiv preprint arXiv:2404.05961. https://arxiv.org/abs/2404.05961↑

[3]Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. https://arxiv.org/abs/1503.02531↑

[4]Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in Neural Information Processing Systems, 30. https://arxiv.org/abs/1612.01474↑

[5]Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In Proceedings of ICCV 2017. https://arxiv.org/abs/1708.02002↑

[6]Steier, A., Manoel, A., Haushalter, A., & Van Segbroeck, M. (2025). Nemotron-PII: Synthesized data for privacy-preserving AI. NVIDIA. https://huggingface.co/datasets/nvidia/Nemotron-PII↑

[7]Lee, S. A., Wu, A., & Chiang, J. N. (2025). Clinical ModernBERT: An efficient and long context encoder for biomedical text. arXiv preprint arXiv:2504.03964. https://arxiv.org/abs/2504.03964↑

[8]Yang, A., et al. (2024). Qwen2 Technical Report. arXiv preprint arXiv:2407.10671. https://arxiv.org/abs/2407.10671↑

[9]Zhang, D., Li, J., Zeng, Z., & Wang, F. (2025). Jasper and Stella: distillation of SOTA embedding models. arXiv preprint arXiv:2412.19048. https://arxiv.org/abs/2412.19048↑

One last thing

The model you just read about treats every document the same way. Including this one.

Continue the conversation.

Filed by: Liam Sullivan, Dilip Thiagarajan, Allan Bishop
Editors: The Latent Engineering Team
Published: May 7, 2026 · v1.0
Citation: Sullivan, L., Thiagarajan, D., & Bishop, A. (2026). Pushing the Pareto Frontier for Clinical Deidentification. Latent Engineering.
Source: latenthealth.com/blog