testing_spatial_metric.md 8.1 KB

Testing a Candidate SUV Spatial Metric as a Biomarker

Ref:

  • ChatGPT 5.4

Suppose each patient/image has a binary outcome

$$ Y_i = \begin{cases} 1, & \text{adverse event (AE)},\ 0, & \text{non-AE / control}, \end{cases} $$

and a candidate imaging metric

$$ z_i \in \mathbb{R}. $$

Examples of candidate metrics are:

$$ z_i = SUV95_i, $$

$$ zi = \operatorname{TailExcess}{95,i}, $$

$$ zi = \operatorname{ComponentEntropy}{95,i}, $$

$$ zi = \operatorname{LocalContrast}{95,i}. $$

The goal is to test whether $z_i$ contains useful information for distinguishing AE from non-AE cases.


1. Descriptive comparison between AE and non-AE groups

First compare the empirical distributions:

$$ {z_i : Y_i = 1} \qquad \text{and} \qquad {z_i : Y_i = 0}. $$

Useful summaries are:

$$ \operatorname{median}(z \mid Y=1), \qquad \operatorname{median}(z \mid Y=0), $$

$$ \operatorname{IQR}(z \mid Y=1), \qquad \operatorname{IQR}(z \mid Y=0). $$

A simple effect-size summary is the difference in medians:

$$

\Delta_{\mathrm{med}}

\operatorname{median}(z \mid Y=1)

\operatorname{median}(z \mid Y=0). $$

Because AE sample size is often small, visualization is essential. Use a strip plot, box plot, or violin plot, with individual patients shown explicitly.


2. Nonparametric group-difference test

A first screening test is the Mann--Whitney U test.

The null hypothesis is:

$$ H_0: z \mid Y=1 \quad \text{and} \quad z \mid Y=0 \quad \text{come from the same distribution}. $$

The alternative hypothesis is:

$$ H_1: z \mid Y=1 \quad \text{and} \quad z \mid Y=0 \quad \text{differ in distribution}. $$

This test is useful for screening, but with small datasets the $p$-value should not be overinterpreted.

Important:

  • A small $p$-value does not prove clinical usefulness.
  • A large $p$-value does not prove the metric is useless.
  • With very small AE count, the test has low statistical power.

3. ROC AUC as a discrimination measure

For biomarker evaluation, ROC AUC is often more relevant than only a group-difference test.

Given a threshold $c$, classify a case as AE if

$$ z_i \ge c. $$

Then define:

$$

\operatorname{TPR}(c)

P(z \ge c \mid Y=1), $$

$$

\operatorname{FPR}(c)

P(z \ge c \mid Y=0). $$

The ROC curve is

$$ \operatorname{TPR}(c) \quad \text{versus} \quad \operatorname{FPR}(c), $$

as the threshold $c$ varies.

The AUC is:

$$

\operatorname{AUC}

P(z{\mathrm{AE}} > z{\mathrm{NC}}), $$

where $z{\mathrm{AE}}$ is a randomly selected AE value and $z{\mathrm{NC}}$ is a randomly selected non-AE value.

Interpretation:

$$ \operatorname{AUC} = 0.5 \quad \Rightarrow \quad \text{no discrimination}, $$

$$ \operatorname{AUC} > 0.5 \quad \Rightarrow \quad \text{larger metric values tend to indicate AE}, $$

$$ \operatorname{AUC} < 0.5 \quad \Rightarrow \quad \text{smaller metric values tend to indicate AE}. $$

For screening, it is useful to define an oriented AUC:

$$

\operatorname{AUC}_{\mathrm{oriented}}

\max(\operatorname{AUC}, 1-\operatorname{AUC}). $$

This measures discrimination strength independent of direction.


4. Bootstrap confidence interval for AUC

Because the AE group is usually small, the AUC estimate can be unstable.

Use bootstrap resampling:

  1. sample patients with replacement;
  2. compute AUC in each bootstrap sample;
  3. repeat many times;
  4. take empirical quantiles of the bootstrap AUC values.

Let

$$ \operatorname{AUC}^{*(b)} $$

be the bootstrap AUC from bootstrap sample $b$, where

$$ b = 1,\dots,B. $$

A simple percentile confidence interval is:

$$ \left[ Q{0.025}\left(\operatorname{AUC}^{*}\right), Q{0.975}\left(\operatorname{AUC}^{*}\right) \right]. $$

If a bootstrap sample contains only one class, it should be skipped because AUC is not defined.

A candidate metric is more promising if:

$$ \operatorname{AUC} $$

is high and the confidence interval is not extremely wide.


5. Logistic regression model for one metric

A simple probabilistic model is:

$$

P(Y_i = 1 \mid z_i)

\operatorname{sigmoid}(\beta_0 + \beta_1 z_i), $$

where

$$

\operatorname{sigmoid}(u)

\frac{1}{1+\exp(-u)}. $$

Equivalently,

$$

\operatorname{logit} P(Y_i = 1 \mid z_i)

\beta_0 + \beta_1 z_i. $$

Before fitting, it is usually useful to standardize the metric:

$$

\tilde{z}_i

\frac{z_i - \bar{z}}{s_z}. $$

Then the model becomes:

$$

P(Y_i = 1 \mid \tilde{z}_i)

\operatorname{sigmoid}(\beta_0 + \beta_1 \tilde{z}_i). $$

The sign of $\beta_1$ gives the direction of association:

$$ \beta_1 > 0 \quad \Rightarrow \quad \text{larger metric values increase AE risk}, $$

$$ \beta_1 < 0 \quad \Rightarrow \quad \text{larger metric values decrease AE risk}. $$

With small AE count, standard logistic regression may be unstable. Penalized logistic regression is often safer.


6. Compare a new metric against baseline SUV95

A new metric is interesting only if it adds information beyond a baseline such as $SUV95$.

Let

$$ z_i^{(0)} = SUV95_i $$

be the baseline metric, and let

$$ z_i^{(1)} $$

be a new candidate metric, for example local contrast or component entropy.

First check correlation:

$$

\rho_S

\operatorname{corr}_{\mathrm{Spearman}} \left( z^{(0)}, z^{(1)} \right). $$

If

$$ |\rho_S| \approx 1, $$

then the new metric may mostly duplicate SUV95.

Then compare logistic models:

Baseline model:

$$

\operatorname{logit} P(Y_i=1)

\beta_0 + \beta_1 z_i^{(0)}. $$

Extended model:

$$

\operatorname{logit} P(Y_i=1)

\beta_0 + \beta_1 z_i^{(0)} + \beta_2 z_i^{(1)}. $$

The new metric is promising if the extended model improves prediction and $\beta_2$ is stable under resampling.

However, with very small AE count, two-variable models can be unreliable. Therefore, this should be treated as exploratory.


7. Cross-validated AUC

To estimate out-of-sample discrimination, use cross-validation.

For each fold:

  1. fit the model on training data;
  2. predict AE probabilities on held-out data;
  3. compute AUC on held-out predictions.

The final estimate is:

$$

\operatorname{AUC}_{CV}

\frac{1}{K} \sum_{k=1}^K \operatorname{AUC}_k. $$

For small AE count, the number of folds must not exceed the number of AE cases:

$$ K \le n_{\mathrm{AE}}. $$

For example, if

$$ n_{\mathrm{AE}} = 5, $$

then at most 5-fold stratified cross-validation is possible.


8. Multiple testing caution

If many metrics are tested,

$$ z^{(1)}, z^{(2)}, \dots, z^{(m)}, $$

then some may appear promising by chance.

For exploratory analysis, report results transparently and avoid strong claims.

For confirmatory analysis, use correction methods such as Bonferroni:

$$

\alpha_{\mathrm{Bonferroni}}

\frac{\alpha}{m}, $$

or false discovery rate control.

In small datasets, it is better to test a small number of biologically motivated metrics than a large radiomics feature set.


9. Recommended reporting table

For each candidate metric, report:

$$ \operatorname{median}(z \mid AE), \qquad \operatorname{median}(z \mid NC), $$

$$ p_{\mathrm{MWU}}, $$

$$ \operatorname{AUC}, $$

$$ 95\% \ \operatorname{CI}_{AUC}. $$

A useful table structure is:

Metric AE median NC median MWU p-value AUC AUC 95% CI Direction
SUV95
TailExcess95
ComponentEntropy95
LargestComponentFraction95
TailLocalContrast95

10. Practical interpretation

A useful biomarker candidate should satisfy several criteria:

  1. It visually separates AE and non-AE cases.
  2. It has reasonable AUC.
  3. Its bootstrap AUC confidence interval is not extremely wide.
  4. It is robust across nearby thresholds, for example $Q{90}$, $Q{95}$, and $Q_{97.5}$.
  5. It is not merely a duplicate of $SUV95$.
  6. It has a plausible biological interpretation.
  7. It remains stable when influential patients are removed.

For spatial SUV metrics, especially promising candidates are:

$$ SUV95, $$

$$ \operatorname{TailExcess}_{95}, $$

$$ \operatorname{LargestComponentFraction}_{95}, $$

$$ \operatorname{ComponentEntropy}_{95}, $$

$$ \operatorname{TailSpread}_{95}, $$

$$ \operatorname{LocalContrast}_{95}. $$

The best metric is not necessarily the one with the smallest $p$-value. It should be interpretable, robust, and clinically plausible.