Ref:
Suppose each patient/image has a binary outcome
$$ Y_i = \begin{cases} 1, & \text{adverse event (AE)},\ 0, & \text{non-AE / control}, \end{cases} $$
and a candidate imaging metric
$$ z_i \in \mathbb{R}. $$
Examples of candidate metrics are:
$$ z_i = SUV95_i, $$
$$ zi = \operatorname{TailExcess}{95,i}, $$
$$ zi = \operatorname{ComponentEntropy}{95,i}, $$
$$ zi = \operatorname{LocalContrast}{95,i}. $$
The goal is to test whether $z_i$ contains useful information for distinguishing AE from non-AE cases.
First compare the empirical distributions:
$$ {z_i : Y_i = 1} \qquad \text{and} \qquad {z_i : Y_i = 0}. $$
Useful summaries are:
$$ \operatorname{median}(z \mid Y=1), \qquad \operatorname{median}(z \mid Y=0), $$
$$ \operatorname{IQR}(z \mid Y=1), \qquad \operatorname{IQR}(z \mid Y=0). $$
A simple effect-size summary is the difference in medians:
$$
\operatorname{median}(z \mid Y=0). $$
Because AE sample size is often small, visualization is essential. Use a strip plot, box plot, or violin plot, with individual patients shown explicitly.
A first screening test is the Mann--Whitney U test.
The null hypothesis is:
$$ H_0: z \mid Y=1 \quad \text{and} \quad z \mid Y=0 \quad \text{come from the same distribution}. $$
The alternative hypothesis is:
$$ H_1: z \mid Y=1 \quad \text{and} \quad z \mid Y=0 \quad \text{differ in distribution}. $$
This test is useful for screening, but with small datasets the $p$-value should not be overinterpreted.
Important:
For biomarker evaluation, ROC AUC is often more relevant than only a group-difference test.
Given a threshold $c$, classify a case as AE if
$$ z_i \ge c. $$
Then define:
$$
P(z \ge c \mid Y=1), $$
$$
P(z \ge c \mid Y=0). $$
The ROC curve is
$$ \operatorname{TPR}(c) \quad \text{versus} \quad \operatorname{FPR}(c), $$
as the threshold $c$ varies.
The AUC is:
$$
P(z{\mathrm{AE}} > z{\mathrm{NC}}), $$
where $z{\mathrm{AE}}$ is a randomly selected AE value and $z{\mathrm{NC}}$ is a randomly selected non-AE value.
Interpretation:
$$ \operatorname{AUC} = 0.5 \quad \Rightarrow \quad \text{no discrimination}, $$
$$ \operatorname{AUC} > 0.5 \quad \Rightarrow \quad \text{larger metric values tend to indicate AE}, $$
$$ \operatorname{AUC} < 0.5 \quad \Rightarrow \quad \text{smaller metric values tend to indicate AE}. $$
For screening, it is useful to define an oriented AUC:
$$
\max(\operatorname{AUC}, 1-\operatorname{AUC}). $$
This measures discrimination strength independent of direction.
Because the AE group is usually small, the AUC estimate can be unstable.
Use bootstrap resampling:
Let
$$ \operatorname{AUC}^{*(b)} $$
be the bootstrap AUC from bootstrap sample $b$, where
$$ b = 1,\dots,B. $$
A simple percentile confidence interval is:
$$ \left[ Q{0.025}\left(\operatorname{AUC}^{*}\right), Q{0.975}\left(\operatorname{AUC}^{*}\right) \right]. $$
If a bootstrap sample contains only one class, it should be skipped because AUC is not defined.
A candidate metric is more promising if:
$$ \operatorname{AUC} $$
is high and the confidence interval is not extremely wide.
A simple probabilistic model is:
$$
\operatorname{sigmoid}(\beta_0 + \beta_1 z_i), $$
where
$$
\frac{1}{1+\exp(-u)}. $$
Equivalently,
$$
\beta_0 + \beta_1 z_i. $$
Before fitting, it is usually useful to standardize the metric:
$$
\frac{z_i - \bar{z}}{s_z}. $$
Then the model becomes:
$$
\operatorname{sigmoid}(\beta_0 + \beta_1 \tilde{z}_i). $$
The sign of $\beta_1$ gives the direction of association:
$$ \beta_1 > 0 \quad \Rightarrow \quad \text{larger metric values increase AE risk}, $$
$$ \beta_1 < 0 \quad \Rightarrow \quad \text{larger metric values decrease AE risk}. $$
With small AE count, standard logistic regression may be unstable. Penalized logistic regression is often safer.
A new metric is interesting only if it adds information beyond a baseline such as $SUV95$.
Let
$$ z_i^{(0)} = SUV95_i $$
be the baseline metric, and let
$$ z_i^{(1)} $$
be a new candidate metric, for example local contrast or component entropy.
First check correlation:
$$
\operatorname{corr}_{\mathrm{Spearman}} \left( z^{(0)}, z^{(1)} \right). $$
If
$$ |\rho_S| \approx 1, $$
then the new metric may mostly duplicate SUV95.
Then compare logistic models:
Baseline model:
$$
\beta_0 + \beta_1 z_i^{(0)}. $$
Extended model:
$$
\beta_0 + \beta_1 z_i^{(0)} + \beta_2 z_i^{(1)}. $$
The new metric is promising if the extended model improves prediction and $\beta_2$ is stable under resampling.
However, with very small AE count, two-variable models can be unreliable. Therefore, this should be treated as exploratory.
To estimate out-of-sample discrimination, use cross-validation.
For each fold:
The final estimate is:
$$
\frac{1}{K} \sum_{k=1}^K \operatorname{AUC}_k. $$
For small AE count, the number of folds must not exceed the number of AE cases:
$$ K \le n_{\mathrm{AE}}. $$
For example, if
$$ n_{\mathrm{AE}} = 5, $$
then at most 5-fold stratified cross-validation is possible.
If many metrics are tested,
$$ z^{(1)}, z^{(2)}, \dots, z^{(m)}, $$
then some may appear promising by chance.
For exploratory analysis, report results transparently and avoid strong claims.
For confirmatory analysis, use correction methods such as Bonferroni:
$$
\frac{\alpha}{m}, $$
or false discovery rate control.
In small datasets, it is better to test a small number of biologically motivated metrics than a large radiomics feature set.
For each candidate metric, report:
$$ \operatorname{median}(z \mid AE), \qquad \operatorname{median}(z \mid NC), $$
$$ p_{\mathrm{MWU}}, $$
$$ \operatorname{AUC}, $$
$$ 95\% \ \operatorname{CI}_{AUC}. $$
A useful table structure is:
| Metric | AE median | NC median | MWU p-value | AUC | AUC 95% CI | Direction |
|---|---|---|---|---|---|---|
| SUV95 | ||||||
| TailExcess95 | ||||||
| ComponentEntropy95 | ||||||
| LargestComponentFraction95 | ||||||
| TailLocalContrast95 |
A useful biomarker candidate should satisfy several criteria:
For spatial SUV metrics, especially promising candidates are:
$$ SUV95, $$
$$ \operatorname{TailExcess}_{95}, $$
$$ \operatorname{LargestComponentFraction}_{95}, $$
$$ \operatorname{ComponentEntropy}_{95}, $$
$$ \operatorname{TailSpread}_{95}, $$
$$ \operatorname{LocalContrast}_{95}. $$
The best metric is not necessarily the one with the smallest $p$-value. It should be interpretable, robust, and clinically plausible.