Reliably Detecting Model Failures in Deployment Without Labels

A practical, deployment-time training-data free, label-free monitoring system that flags when a deployed classifier deteriorates in performance.

D3M (Disagreement-Driven Deterioration Monitoring) computes ID disagreement rates from bootstrapped, held-out validation sets. At deployment, upon receiving unsupervised, potentially OOD samples, D3M compares the disagreement rate on this query with pre-computed ID disagreement rates, and outputs a verdict.

Importantly, the principles driving D3M can be applied to any base model with minor tweaking. This includes large language and vision models!

Label-free

Monitor without labels

Designed to run on unlabeled deployment streams, ideal for healthcare and long-horizon targets.

Training-data free

No training set at deployment

All calibration information is compressed into disagreement statistics $\Phi$.

Provable

Statistical guarantees

Controls false positives (FPR) under benign shifts and detects deteriorating shifts under certain assumptions.

Method Overview

D3M (Disagreement-Driven Deterioration Monitoring) proceeds in three stages: Train, Calibrate, and Deploy.

Overview of D3M pipeline.

1. Train — Base model learning

A composition $\operatorname{VBLL}_\theta \circ \operatorname{FE}_\theta$ of a neural feature extractor $ \operatorname{FE}_\theta : \mathcal{X} \to \mathbb{R}^d $ and a Variational Bayesian Last Layer (VBLL) $ \operatorname{VBLL}_\theta : \mathbb{R}^d \to \mathcal{P}(\mathbb{R}^C) $ is trained on supervised ID data $ \mathcal{D}^n \sim \mathbb{P}^n $ to produce posteriors over logits for $ C $ classes:

\[ \psi = \operatorname{FE}_\theta(x), \quad q_\theta(\cdot|x) = \operatorname{VBLL}_\theta(\psi) \]

The base model defines the posterior predictive distribution (PPD):

\[ \mathbb{P}(y|x,\mathcal{D}^n) = \mathbb{E}_{z \sim q_\theta(\cdot|x)}[\operatorname{softmax}(z)_y] \]

Training maximizes the ELBO with Gaussian prior $ p(z) $:

\[ \mathcal{L}_{\operatorname{ELBO}}(\theta;x,y) = \mathbb{E}_{z \sim q_\theta(\cdot|x)}[\log \operatorname{softmax}(z)_y] - \operatorname{KL}[q_\theta(z|x)\|p(z)] \]

2. Calibrate — ID Disagreement statistics

On held-out ID data, we repeatedly bootstrap $ \mathcal{D}^m_t = \{x_i\}_{i=1}^m \sim \mathbb{P}^m $, and compute maximum disagreement rates $ \phi_t $. Base model pseudo-labels are obtained as:

\[ \bar{y}_i = \arg\max_{y \in [C]} \mathbb{E}_{z_i \sim q_\theta(\cdot|x_i)}[ \operatorname{softmax}(z_i)_y], \quad \forall x_i \in \mathcal{D}^m_t \]

We draw $ K $ posterior samples, apply temperature scaling, and sample labels:

\[ z_i^{(k)} \sim q_\theta(\cdot|x_i), \quad p_i^{(k)} = \operatorname{softmax}\left(\frac{z_i^{(k)}}{\tau}\right), \quad \hat{y}_i^{(k)} \sim \operatorname{Categorical}(p_i^{(k)}) \]

The disagreement rate for sample $ k $ is:

\[ \operatorname{DisRate}(k) = \frac{1}{m}\sum_{i=1}^m \mathbb{I}\{\hat{y}_i^{(k)} \neq \bar{y}_i\} \]

and the maximum disagreement rate for round $ t $:

\[ \phi_t = \max_{k \in [K]} \operatorname{DisRate}(k) \]

The collection of all ID disagreements forms $ \Phi = \{\phi_t : t \in [T]\} $.

3. Deploy — Deterioration monitoring

At deployment, we collect unlabeled batches $ \mathcal{D}^m_{\text{te}} \sim \mathbb{Q}^m $ and compute the same statistic $ \tilde{\phi} $. The system raises an alert if:

\[ \tilde{\phi} \ge \operatorname{Quantile}_{1-\alpha}(\Phi) \]

ensuring a FPR bounded by $ \alpha $ under ID conditions.

Theory

For a formal theoretical analysis and sample complexities to guarantee low FPR and high TPR with high probability, please visit Appendix A in our paper!

Why does this work?

We need a computable quantity $\phi$ independent of labels, whose value statistically differs ID and OOD if and only if model deterioration occurs.
It turns out, model disagreement rate fits exactly this description if the deterioration is due to a shift in covariates!
Thus, our method estimates an empirical distribution of ID disagreement rates $\Phi$. Deterioration detection simply regresses to determining whether a sample's model disagreement rate could've come from $\Phi$!

What is model disagreement?

Assume among a family $\mathcal{H}$, an ERM algorithm on ID training samples $\mathcal{D}^n$ selects hypothesis $f$. Among other competing hypotheses $h$ explaining $\mathcal{D}^n$, the ID model disagreement rate is the disagreement rate of the maximally disagreeing hypothesis $h$ with respect to $f$ on a held-out ID validation batch.

The search for a maximally disagreeing hypothesis is difficult. We turn this optimization problem into a sampling problem by sampling candidate hypotheses from a Bayesian model, and taking the maximally disagreeing hypothesis.

We thus chart an empirical distribution of approximate disagreement rates for random bootstraps of a held-out ID validation set to obtain a robust statistical quantification of the ID disagreement rate.

Implementation details

The composition $\operatorname{VBLL}_\theta \circ \operatorname{FE}_\theta$ is trained end-to-end on ID data of size $ n $ by maximizing the ELBO. For calibration, across rounds $ t \in [T] $, we sample datasets $ \mathcal{D}^m_t $ of size $ m \ll n $ (with replacement) from a large held-out ID validation set. Each $ \mathcal{D}^m_t $ is processed in a single forward pass, producing $ m $ independent $ C $-dimensional Gaussian posteriors:

$$q_\theta(z_i \mid x_i) ≔ \mathcal{N}(z_i \mid \mu_\theta(x_i), \operatorname{diag}(\sigma_\theta^2(x_i)))$$

For each posterior, we draw $ 2K $ samples—$ K $ for Monte Carlo estimation of the mean model, and $ K $ for maximum-disagreement computations. We keep the sampling temperature $ \tau $, the dataset size $ m $, and the number of posterior samples $ K $ as tunable hyperparameters. Crucially, these must be identical between the computation of $ \Phi $ and the subsequent deployment-time monitoring test, where the exact same procedure is applied on a deployment sample $ \mathcal{D}^m_{\textbf{te}} $.

Monitoring large models

The flexibility of D3M lies in the broad selection of $\operatorname{FE}_\theta$. Vision language model features may be used to monitor vision classification tasks, while language model bases such as BERT may be used as feature extractors for text classification tasks.

Of course, as stated, this does not work for any classification model based on in-context prompting due to D3M's reliance on $\operatorname{VBLL}_\theta$, and is work in progress.

Results

We summarize performance on both classical benchmarks and real-world GEMINI clinical data. Full quantitative tables and visualizations are available in the paper.

Classical Benchmarks

D3M demonstrates strong TPR across a diverse set of datasets — UCI Heart Disease, CIFAR-10/10.1, and Camelyon-17 — while maintaining controlled FPR under ID calibration.

	UCI Heart Disease			CIFAR 10.1			Camelyon 17
	10	20	50	10	20	50	10	20	50
BBSD	.13±.03	.22±.04	.46±.05	.07±.03	.05±.02	.12±.03	.16±.04	.38±.05	.87±.03
Rel. Mahalanobis	.11±.03	.36±.05	.66±.05	.05±.02	.03±.03	.04±.02	.16±.04	.40±.05	.89±.03
Deep Ensemble	.13±.03	.32±.05	.64±.05	.33±.05	.52±.05	.68±.05	.14±.03	.26±.04	.82±.04
CTST	.15±.04	.51±.05	.98±.01	.03±.02	.04±.02	.04±.02	.11±.03	.59±.05	.59±.05
MMD-D	.09±.03	.12±.03	.27±.04	.24±.04	.10±.03	.05±.02	.42±.05	.62±.05	.69±.05
H-Div	.15±.04	.26±.04	.37±.05	.02±.01	.05±.02	.04±.02	.03±.02	.07±.03	.23±.04
Detectron	.24±.04	.57±.05	.82±.04	.37±.05	.54±.05	.83±.04	.97±.02	1.0±.00	.96±.02
D3M (Ours)	.38±.19	.25±.28	.69±.33	.40±.10	.45±.10	.74±.12	.89±.20	.93±.05	.99±.02

Table 1. True positive rates (TPR) comparison across datasets and query sizes. As models experience deterioration, higher TPR indicates better detection performance. Bold values denote column best; results are averaged over 10 independent seeds. D3M results are averaged across runs achieving ID FPR below $\alpha$.

Results on GEMINI

Results illustrating D3M’s deployment on GEMINI, an EHR dataset with tabular engineered features. The task is 14-day mortality prediction.

Non-deteriorating temporal shift:

(Left) Performance of the base model over half-years since 2017. Little drop in performance is observed, the natural temporal shift is thus non-deteriorating. (Right) Indeed, D3M does not pick up on longitudinal distribution shifts. We know these shifts exist since distance-based baselines (JS-Div, KL-Div, Relative Mahalanobis) heavily flag these changes.

Manufactured deteriorating shift:

(Left) The base model is trained on patients between 18-52 years of age. Then, the model is deployed on mixtures of test data mixed at different ratios with patients above 85 years old to induce an artificial distribution shift. Indeed, model deterioration is observed. (Right) D3M's TPR grows stronger as more OOD data is mixed into the deployment test sample.

Authors

The team behind D3M: Disagreement-Driven Deterioration Monitoring.

Viet Nguyen^†

University of Toronto

Changjian Shui^†

University of Ottawa

Vijay Giri

University of Pennsylvania

Siddharth Arya

University of Toronto

Amol Verma

Unity Health Toronto

Fahad Razak

Unity Health Toronto

Rahul G. Krishnan

University of Toronto

BibTeX tag

If you find this work useful in your research, please cite:

@inproceedings{nguyen2025disagreement,
  title={Reliably detecting model failures in deployment without labels},
  author={Nguyen, Viet and Shui, Changjian and Giri, Vijay and Arya, Siddharth and Verma, Amol and Razak, Fahad and Krishnan, Rahul G},
  booktitle = {Proceedings of the Conference on Neural Information Processing Systems (NeurIPS)},
  year={2025},
}