# Testing a Candidate SUV Spatial Metric as a Biomarker

Ref:

* ChatGPT 5.4


Suppose each patient/image has a binary outcome

$$
Y_i =
\begin{cases}
1, & \text{adverse event (AE)},\\
0, & \text{non-AE / control},
\end{cases}
$$

and a candidate imaging metric

$$
z_i \in \mathbb{R}.
$$

Examples of candidate metrics are:

$$
z_i = SUV95_i,
$$

$$
z_i = \operatorname{TailExcess}_{95,i},
$$

$$
z_i = \operatorname{ComponentEntropy}_{95,i},
$$

$$
z_i = \operatorname{LocalContrast}_{95,i}.
$$

The goal is to test whether $z_i$ contains useful information for distinguishing AE from non-AE cases.

---

## 1. Descriptive comparison between AE and non-AE groups

First compare the empirical distributions:

$$
\{z_i : Y_i = 1\}
\qquad \text{and} \qquad
\{z_i : Y_i = 0\}.
$$

Useful summaries are:

$$
\operatorname{median}(z \mid Y=1),
\qquad
\operatorname{median}(z \mid Y=0),
$$

$$
\operatorname{IQR}(z \mid Y=1),
\qquad
\operatorname{IQR}(z \mid Y=0).
$$

A simple effect-size summary is the difference in medians:

$$
\Delta_{\mathrm{med}}
=
\operatorname{median}(z \mid Y=1)
-
\operatorname{median}(z \mid Y=0).
$$

Because AE sample size is often small, visualization is essential. Use a strip plot, box plot, or violin plot, with individual patients shown explicitly.

---

## 2. Nonparametric group-difference test

A first screening test is the Mann--Whitney U test.

The null hypothesis is:

$$
H_0:
z \mid Y=1
\quad \text{and} \quad
z \mid Y=0
\quad
\text{come from the same distribution}.
$$

The alternative hypothesis is:

$$
H_1:
z \mid Y=1
\quad \text{and} \quad
z \mid Y=0
\quad
\text{differ in distribution}.
$$

This test is useful for screening, but with small datasets the $p$-value should not be overinterpreted.

Important:

- A small $p$-value does not prove clinical usefulness.
- A large $p$-value does not prove the metric is useless.
- With very small AE count, the test has low statistical power.

---

## 3. ROC AUC as a discrimination measure

For biomarker evaluation, ROC AUC is often more relevant than only a group-difference test.

Given a threshold $c$, classify a case as AE if

$$
z_i \ge c.
$$

Then define:

$$
\operatorname{TPR}(c)
=
P(z \ge c \mid Y=1),
$$

$$
\operatorname{FPR}(c)
=
P(z \ge c \mid Y=0).
$$

The ROC curve is

$$
\operatorname{TPR}(c)
\quad \text{versus} \quad
\operatorname{FPR}(c),
$$

as the threshold $c$ varies.

The AUC is:

$$
\operatorname{AUC}
=
P(z_{\mathrm{AE}} > z_{\mathrm{NC}}),
$$

where $z_{\mathrm{AE}}$ is a randomly selected AE value and $z_{\mathrm{NC}}$ is a randomly selected non-AE value.

Interpretation:

$$
\operatorname{AUC} = 0.5
\quad \Rightarrow \quad
\text{no discrimination},
$$

$$
\operatorname{AUC} > 0.5
\quad \Rightarrow \quad
\text{larger metric values tend to indicate AE},
$$

$$
\operatorname{AUC} < 0.5
\quad \Rightarrow \quad
\text{smaller metric values tend to indicate AE}.
$$

For screening, it is useful to define an oriented AUC:

$$
\operatorname{AUC}_{\mathrm{oriented}}
=
\max(\operatorname{AUC}, 1-\operatorname{AUC}).
$$

This measures discrimination strength independent of direction.

---

## 4. Bootstrap confidence interval for AUC

Because the AE group is usually small, the AUC estimate can be unstable.

Use bootstrap resampling:

1. sample patients with replacement;
2. compute AUC in each bootstrap sample;
3. repeat many times;
4. take empirical quantiles of the bootstrap AUC values.

Let

$$
\operatorname{AUC}^{*(b)}
$$

be the bootstrap AUC from bootstrap sample $b$, where

$$
b = 1,\dots,B.
$$

A simple percentile confidence interval is:

$$
\left[
Q_{0.025}\left(\operatorname{AUC}^{*}\right),
Q_{0.975}\left(\operatorname{AUC}^{*}\right)
\right].
$$

If a bootstrap sample contains only one class, it should be skipped because AUC is not defined.

A candidate metric is more promising if:

$$
\operatorname{AUC}
$$

is high and the confidence interval is not extremely wide.

---

## 5. Logistic regression model for one metric

A simple probabilistic model is:

$$
P(Y_i = 1 \mid z_i)
=
\operatorname{sigmoid}(\beta_0 + \beta_1 z_i),
$$

where

$$
\operatorname{sigmoid}(u)
=
\frac{1}{1+\exp(-u)}.
$$

Equivalently,

$$
\operatorname{logit} P(Y_i = 1 \mid z_i)
=
\beta_0 + \beta_1 z_i.
$$

Before fitting, it is usually useful to standardize the metric:

$$
\tilde{z}_i
=
\frac{z_i - \bar{z}}{s_z}.
$$

Then the model becomes:

$$
P(Y_i = 1 \mid \tilde{z}_i)
=
\operatorname{sigmoid}(\beta_0 + \beta_1 \tilde{z}_i).
$$

The sign of $\beta_1$ gives the direction of association:

$$
\beta_1 > 0
\quad \Rightarrow \quad
\text{larger metric values increase AE risk},
$$

$$
\beta_1 < 0
\quad \Rightarrow \quad
\text{larger metric values decrease AE risk}.
$$

With small AE count, standard logistic regression may be unstable. Penalized logistic regression is often safer.

---

## 6. Compare a new metric against baseline SUV95

A new metric is interesting only if it adds information beyond a baseline such as $SUV95$.

Let

$$
z_i^{(0)} = SUV95_i
$$

be the baseline metric, and let

$$
z_i^{(1)}
$$

be a new candidate metric, for example local contrast or component entropy.

First check correlation:

$$
\rho_S
=
\operatorname{corr}_{\mathrm{Spearman}}
\left(
z^{(0)}, z^{(1)}
\right).
$$

If

$$
|\rho_S| \approx 1,
$$

then the new metric may mostly duplicate SUV95.

Then compare logistic models:

Baseline model:

$$
\operatorname{logit} P(Y_i=1)
=
\beta_0 + \beta_1 z_i^{(0)}.
$$

Extended model:

$$
\operatorname{logit} P(Y_i=1)
=
\beta_0 + \beta_1 z_i^{(0)} + \beta_2 z_i^{(1)}.
$$

The new metric is promising if the extended model improves prediction and $\beta_2$ is stable under resampling.

However, with very small AE count, two-variable models can be unreliable. Therefore, this should be treated as exploratory.

---

## 7. Cross-validated AUC

To estimate out-of-sample discrimination, use cross-validation.

For each fold:

1. fit the model on training data;
2. predict AE probabilities on held-out data;
3. compute AUC on held-out predictions.

The final estimate is:

$$
\operatorname{AUC}_{CV}
=
\frac{1}{K}
\sum_{k=1}^K
\operatorname{AUC}_k.
$$

For small AE count, the number of folds must not exceed the number of AE cases:

$$
K \le n_{\mathrm{AE}}.
$$

For example, if

$$
n_{\mathrm{AE}} = 5,
$$

then at most 5-fold stratified cross-validation is possible.

---

## 8. Multiple testing caution

If many metrics are tested,

$$
z^{(1)}, z^{(2)}, \dots, z^{(m)},
$$

then some may appear promising by chance.

For exploratory analysis, report results transparently and avoid strong claims.

For confirmatory analysis, use correction methods such as Bonferroni:

$$
\alpha_{\mathrm{Bonferroni}}
=
\frac{\alpha}{m},
$$

or false discovery rate control.

In small datasets, it is better to test a small number of biologically motivated metrics than a large radiomics feature set.

---

## 9. Recommended reporting table

For each candidate metric, report:

$$
\operatorname{median}(z \mid AE),
\qquad
\operatorname{median}(z \mid NC),
$$

$$
p_{\mathrm{MWU}},
$$

$$
\operatorname{AUC},
$$

$$
95\% \ \operatorname{CI}_{AUC}.
$$

A useful table structure is:

| Metric | AE median | NC median | MWU p-value | AUC | AUC 95% CI | Direction |
|---|---:|---:|---:|---:|---:|---|
| SUV95 | | | | | | |
| TailExcess95 | | | | | | |
| ComponentEntropy95 | | | | | | |
| LargestComponentFraction95 | | | | | | |
| TailLocalContrast95 | | | | | | |

---

## 10. Practical interpretation

A useful biomarker candidate should satisfy several criteria:

1. It visually separates AE and non-AE cases.
2. It has reasonable AUC.
3. Its bootstrap AUC confidence interval is not extremely wide.
4. It is robust across nearby thresholds, for example $Q_{90}$, $Q_{95}$, and $Q_{97.5}$.
5. It is not merely a duplicate of $SUV95$.
6. It has a plausible biological interpretation.
7. It remains stable when influential patients are removed.

For spatial SUV metrics, especially promising candidates are:

$$
SUV95,
$$

$$
\operatorname{TailExcess}_{95},
$$

$$
\operatorname{LargestComponentFraction}_{95},
$$

$$
\operatorname{ComponentEntropy}_{95},
$$

$$
\operatorname{TailSpread}_{95},
$$

$$
\operatorname{LocalContrast}_{95}.
$$

The best metric is not necessarily the one with the smallest $p$-value. It should be interpretable, robust, and clinically plausible.