Testing a Candidate SUV Spatial Metric as a Biomarker

Ref:

ChatGPT 5.4

Suppose each patient/image has a binary outcome

$$ Y_i = \begin{cases} 1, & \text{adverse event (AE)},\ 0, & \text{non-AE / control}, \end{cases} $$

and a candidate imaging metric

$$ z_i \in \mathbb{R}. $$

Examples of candidate metrics are:

$$ z_i = SUV95_i, $$

$$ zi = \operatorname{TailExcess}{95,i}, $$

$$ zi = \operatorname{ComponentEntropy}{95,i}, $$

$$ zi = \operatorname{LocalContrast}{95,i}. $$

The goal is to test whether $z_i$ contains useful information for distinguishing AE from non-AE cases.

1. Descriptive comparison between AE and non-AE groups

First compare the empirical distributions:

$$ {z_i : Y_i = 1} \qquad \text{and} \qquad {z_i : Y_i = 0}. $$

Useful summaries are:

$$ \operatorname{median}(z \mid Y=1), \qquad \operatorname{median}(z \mid Y=0), $$

$$ \operatorname{IQR}(z \mid Y=1), \qquad \operatorname{IQR}(z \mid Y=0). $$

A simple effect-size summary is the difference in medians:

\Delta_{\mathrm{med}}

\operatorname{median}(z \mid Y=1)

\operatorname{median}(z \mid Y=0). $$

Because AE sample size is often small, visualization is essential. Use a strip plot, box plot, or violin plot, with individual patients shown explicitly.

2. Nonparametric group-difference test

A first screening test is the Mann--Whitney U test.

The null hypothesis is:

$$ H_0: z \mid Y=1 \quad \text{and} \quad z \mid Y=0 \quad \text{come from the same distribution}. $$

The alternative hypothesis is:

$$ H_1: z \mid Y=1 \quad \text{and} \quad z \mid Y=0 \quad \text{differ in distribution}. $$

This test is useful for screening, but with small datasets the $p$-value should not be overinterpreted.

Important:

A small $p$-value does not prove clinical usefulness.
A large $p$-value does not prove the metric is useless.
With very small AE count, the test has low statistical power.

3. ROC AUC as a discrimination measure

For biomarker evaluation, ROC AUC is often more relevant than only a group-difference test.

Given a threshold $c$, classify a case as AE if

$$ z_i \ge c. $$

Then define:

\operatorname{TPR}(c)

P(z \ge c \mid Y=1), $$

\operatorname{FPR}(c)

P(z \ge c \mid Y=0). $$

The ROC curve is

$$ \operatorname{TPR}(c) \quad \text{versus} \quad \operatorname{FPR}(c), $$

as the threshold $c$ varies.

The AUC is:

\operatorname{AUC}

P(z{\mathrm{AE}} > z{\mathrm{NC}}), $$

where $z{\mathrm{AE}}$ is a randomly selected AE value and $z{\mathrm{NC}}$ is a randomly selected non-AE value.

Interpretation:

$$ \operatorname{AUC} = 0.5 \quad \Rightarrow \quad \text{no discrimination}, $$

$$ \operatorname{AUC} > 0.5 \quad \Rightarrow \quad \text{larger metric values tend to indicate AE}, $$

$$ \operatorname{AUC} < 0.5 \quad \Rightarrow \quad \text{smaller metric values tend to indicate AE}. $$

For screening, it is useful to define an oriented AUC:

\operatorname{AUC}_{\mathrm{oriented}}

\max(\operatorname{AUC}, 1-\operatorname{AUC}). $$

This measures discrimination strength independent of direction.

4. Bootstrap confidence interval for AUC

Because the AE group is usually small, the AUC estimate can be unstable.

Use bootstrap resampling:

sample patients with replacement;
compute AUC in each bootstrap sample;
repeat many times;
take empirical quantiles of the bootstrap AUC values.

Let

$$ \operatorname{AUC}^{*(b)} $$

be the bootstrap AUC from bootstrap sample $b$, where

$$ b = 1,\dots,B. $$

A simple percentile confidence interval is:

$$ \left[ Q{0.025}\left(\operatorname{AUC}^{*}\right), Q{0.975}\left(\operatorname{AUC}^{*}\right) \right]. $$

If a bootstrap sample contains only one class, it should be skipped because AUC is not defined.

A candidate metric is more promising if:

$$ \operatorname{AUC} $$

is high and the confidence interval is not extremely wide.

5. Logistic regression model for one metric

A simple probabilistic model is:

P(Y_i = 1 \mid z_i)

\operatorname{sigmoid}(\beta_0 + \beta_1 z_i), $$

where

\operatorname{sigmoid}(u)

\frac{1}{1+\exp(-u)}. $$

Equivalently,

\operatorname{logit} P(Y_i = 1 \mid z_i)

\beta_0 + \beta_1 z_i. $$

Before fitting, it is usually useful to standardize the metric:

\tilde{z}_i

\frac{z_i - \bar{z}}{s_z}. $$

Then the model becomes:

P(Y_i = 1 \mid \tilde{z}_i)

\operatorname{sigmoid}(\beta_0 + \beta_1 \tilde{z}_i). $$

The sign of $\beta_1$ gives the direction of association:

$$ \beta_1 > 0 \quad \Rightarrow \quad \text{larger metric values increase AE risk}, $$

$$ \beta_1 < 0 \quad \Rightarrow \quad \text{larger metric values decrease AE risk}. $$

With small AE count, standard logistic regression may be unstable. Penalized logistic regression is often safer.

6. Compare a new metric against baseline SUV95

A new metric is interesting only if it adds information beyond a baseline such as $SUV95$.

Let

$$ z_i^{(0)} = SUV95_i $$

be the baseline metric, and let

$$ z_i^{(1)} $$

be a new candidate metric, for example local contrast or component entropy.

First check correlation:

\rho_S

\operatorname{corr}_{\mathrm{Spearman}} \left( z^{(0)}, z^{(1)} \right). $$

$$ |\rho_S| \approx 1, $$

then the new metric may mostly duplicate SUV95.

Then compare logistic models:

Baseline model:

\operatorname{logit} P(Y_i=1)

\beta_0 + \beta_1 z_i^{(0)}. $$

Extended model:

\operatorname{logit} P(Y_i=1)

\beta_0 + \beta_1 z_i^{(0)} + \beta_2 z_i^{(1)}. $$

The new metric is promising if the extended model improves prediction and $\beta_2$ is stable under resampling.

However, with very small AE count, two-variable models can be unreliable. Therefore, this should be treated as exploratory.

7. Cross-validated AUC

To estimate out-of-sample discrimination, use cross-validation.

For each fold:

fit the model on training data;
predict AE probabilities on held-out data;
compute AUC on held-out predictions.

The final estimate is:

\operatorname{AUC}_{CV}

\frac{1}{K} \sum_{k=1}^K \operatorname{AUC}_k. $$

For small AE count, the number of folds must not exceed the number of AE cases:

$$ K \le n_{\mathrm{AE}}. $$

For example, if

$$ n_{\mathrm{AE}} = 5, $$

then at most 5-fold stratified cross-validation is possible.

8. Multiple testing caution

If many metrics are tested,

$$ z^{(1)}, z^{(2)}, \dots, z^{(m)}, $$

then some may appear promising by chance.

For exploratory analysis, report results transparently and avoid strong claims.

For confirmatory analysis, use correction methods such as Bonferroni:

\alpha_{\mathrm{Bonferroni}}

\frac{\alpha}{m}, $$

or false discovery rate control.

In small datasets, it is better to test a small number of biologically motivated metrics than a large radiomics feature set.

9. Recommended reporting table

For each candidate metric, report:

$$ \operatorname{median}(z \mid AE), \qquad \operatorname{median}(z \mid NC), $$

$$ p_{\mathrm{MWU}}, $$

$$ \operatorname{AUC}, $$

$$ 95\% \ \operatorname{CI}_{AUC}. $$

A useful table structure is:

Metric	AE median	NC median	MWU p-value	AUC	AUC 95% CI	Direction
SUV95
TailExcess95
ComponentEntropy95
LargestComponentFraction95
TailLocalContrast95

10. Practical interpretation

A useful biomarker candidate should satisfy several criteria:

It visually separates AE and non-AE cases.
It has reasonable AUC.
Its bootstrap AUC confidence interval is not extremely wide.
It is robust across nearby thresholds, for example $Q{90}$, $Q{95}$, and $Q_{97.5}$.
It is not merely a duplicate of $SUV95$.
It has a plausible biological interpretation.
It remains stable when influential patients are removed.

For spatial SUV metrics, especially promising candidates are:

$$ SUV95, $$

$$ \operatorname{TailExcess}_{95}, $$

$$ \operatorname{LargestComponentFraction}_{95}, $$

$$ \operatorname{ComponentEntropy}_{95}, $$

$$ \operatorname{TailSpread}_{95}, $$

$$ \operatorname{LocalContrast}_{95}. $$

The best metric is not necessarily the one with the smallest $p$-value. It should be interpretable, robust, and clinically plausible.

testing_spatial_metric.md 8.1 KB Постоянная ссылка История Исходник

Testing a Candidate SUV Spatial Metric as a Biomarker

1. Descriptive comparison between AE and non-AE groups

\Delta_{\mathrm{med}}

\operatorname{median}(z \mid Y=1)

2. Nonparametric group-difference test

3. ROC AUC as a discrimination measure

\operatorname{TPR}(c)

\operatorname{FPR}(c)

\operatorname{AUC}

\operatorname{AUC}_{\mathrm{oriented}}

4. Bootstrap confidence interval for AUC

5. Logistic regression model for one metric

P(Y_i = 1 \mid z_i)

\operatorname{sigmoid}(u)

\operatorname{logit} P(Y_i = 1 \mid z_i)

\tilde{z}_i

P(Y_i = 1 \mid \tilde{z}_i)

6. Compare a new metric against baseline SUV95

\rho_S

\operatorname{logit} P(Y_i=1)

\operatorname{logit} P(Y_i=1)

7. Cross-validated AUC

\operatorname{AUC}_{CV}

8. Multiple testing caution

\alpha_{\mathrm{Bonferroni}}

9. Recommended reporting table

10. Practical interpretation

testing_spatial_metric.md 8.1 KB

Постоянная ссылка История Исходник