# Testing a Candidate SUV Spatial Metric as a Biomarker Ref: * ChatGPT 5.4 Suppose each patient/image has a binary outcome $$ Y_i = \begin{cases} 1, & \text{adverse event (AE)},\\ 0, & \text{non-AE / control}, \end{cases} $$ and a candidate imaging metric $$ z_i \in \mathbb{R}. $$ Examples of candidate metrics are: $$ z_i = SUV95_i, $$ $$ z_i = \operatorname{TailExcess}_{95,i}, $$ $$ z_i = \operatorname{ComponentEntropy}_{95,i}, $$ $$ z_i = \operatorname{LocalContrast}_{95,i}. $$ The goal is to test whether $z_i$ contains useful information for distinguishing AE from non-AE cases. --- ## 1. Descriptive comparison between AE and non-AE groups First compare the empirical distributions: $$ \{z_i : Y_i = 1\} \qquad \text{and} \qquad \{z_i : Y_i = 0\}. $$ Useful summaries are: $$ \operatorname{median}(z \mid Y=1), \qquad \operatorname{median}(z \mid Y=0), $$ $$ \operatorname{IQR}(z \mid Y=1), \qquad \operatorname{IQR}(z \mid Y=0). $$ A simple effect-size summary is the difference in medians: $$ \Delta_{\mathrm{med}} = \operatorname{median}(z \mid Y=1) - \operatorname{median}(z \mid Y=0). $$ Because AE sample size is often small, visualization is essential. Use a strip plot, box plot, or violin plot, with individual patients shown explicitly. --- ## 2. Nonparametric group-difference test A first screening test is the Mann--Whitney U test. The null hypothesis is: $$ H_0: z \mid Y=1 \quad \text{and} \quad z \mid Y=0 \quad \text{come from the same distribution}. $$ The alternative hypothesis is: $$ H_1: z \mid Y=1 \quad \text{and} \quad z \mid Y=0 \quad \text{differ in distribution}. $$ This test is useful for screening, but with small datasets the $p$-value should not be overinterpreted. Important: - A small $p$-value does not prove clinical usefulness. - A large $p$-value does not prove the metric is useless. - With very small AE count, the test has low statistical power. --- ## 3. ROC AUC as a discrimination measure For biomarker evaluation, ROC AUC is often more relevant than only a group-difference test. Given a threshold $c$, classify a case as AE if $$ z_i \ge c. $$ Then define: $$ \operatorname{TPR}(c) = P(z \ge c \mid Y=1), $$ $$ \operatorname{FPR}(c) = P(z \ge c \mid Y=0). $$ The ROC curve is $$ \operatorname{TPR}(c) \quad \text{versus} \quad \operatorname{FPR}(c), $$ as the threshold $c$ varies. The AUC is: $$ \operatorname{AUC} = P(z_{\mathrm{AE}} > z_{\mathrm{NC}}), $$ where $z_{\mathrm{AE}}$ is a randomly selected AE value and $z_{\mathrm{NC}}$ is a randomly selected non-AE value. Interpretation: $$ \operatorname{AUC} = 0.5 \quad \Rightarrow \quad \text{no discrimination}, $$ $$ \operatorname{AUC} > 0.5 \quad \Rightarrow \quad \text{larger metric values tend to indicate AE}, $$ $$ \operatorname{AUC} < 0.5 \quad \Rightarrow \quad \text{smaller metric values tend to indicate AE}. $$ For screening, it is useful to define an oriented AUC: $$ \operatorname{AUC}_{\mathrm{oriented}} = \max(\operatorname{AUC}, 1-\operatorname{AUC}). $$ This measures discrimination strength independent of direction. --- ## 4. Bootstrap confidence interval for AUC Because the AE group is usually small, the AUC estimate can be unstable. Use bootstrap resampling: 1. sample patients with replacement; 2. compute AUC in each bootstrap sample; 3. repeat many times; 4. take empirical quantiles of the bootstrap AUC values. Let $$ \operatorname{AUC}^{*(b)} $$ be the bootstrap AUC from bootstrap sample $b$, where $$ b = 1,\dots,B. $$ A simple percentile confidence interval is: $$ \left[ Q_{0.025}\left(\operatorname{AUC}^{*}\right), Q_{0.975}\left(\operatorname{AUC}^{*}\right) \right]. $$ If a bootstrap sample contains only one class, it should be skipped because AUC is not defined. A candidate metric is more promising if: $$ \operatorname{AUC} $$ is high and the confidence interval is not extremely wide. --- ## 5. Logistic regression model for one metric A simple probabilistic model is: $$ P(Y_i = 1 \mid z_i) = \operatorname{sigmoid}(\beta_0 + \beta_1 z_i), $$ where $$ \operatorname{sigmoid}(u) = \frac{1}{1+\exp(-u)}. $$ Equivalently, $$ \operatorname{logit} P(Y_i = 1 \mid z_i) = \beta_0 + \beta_1 z_i. $$ Before fitting, it is usually useful to standardize the metric: $$ \tilde{z}_i = \frac{z_i - \bar{z}}{s_z}. $$ Then the model becomes: $$ P(Y_i = 1 \mid \tilde{z}_i) = \operatorname{sigmoid}(\beta_0 + \beta_1 \tilde{z}_i). $$ The sign of $\beta_1$ gives the direction of association: $$ \beta_1 > 0 \quad \Rightarrow \quad \text{larger metric values increase AE risk}, $$ $$ \beta_1 < 0 \quad \Rightarrow \quad \text{larger metric values decrease AE risk}. $$ With small AE count, standard logistic regression may be unstable. Penalized logistic regression is often safer. --- ## 6. Compare a new metric against baseline SUV95 A new metric is interesting only if it adds information beyond a baseline such as $SUV95$. Let $$ z_i^{(0)} = SUV95_i $$ be the baseline metric, and let $$ z_i^{(1)} $$ be a new candidate metric, for example local contrast or component entropy. First check correlation: $$ \rho_S = \operatorname{corr}_{\mathrm{Spearman}} \left( z^{(0)}, z^{(1)} \right). $$ If $$ |\rho_S| \approx 1, $$ then the new metric may mostly duplicate SUV95. Then compare logistic models: Baseline model: $$ \operatorname{logit} P(Y_i=1) = \beta_0 + \beta_1 z_i^{(0)}. $$ Extended model: $$ \operatorname{logit} P(Y_i=1) = \beta_0 + \beta_1 z_i^{(0)} + \beta_2 z_i^{(1)}. $$ The new metric is promising if the extended model improves prediction and $\beta_2$ is stable under resampling. However, with very small AE count, two-variable models can be unreliable. Therefore, this should be treated as exploratory. --- ## 7. Cross-validated AUC To estimate out-of-sample discrimination, use cross-validation. For each fold: 1. fit the model on training data; 2. predict AE probabilities on held-out data; 3. compute AUC on held-out predictions. The final estimate is: $$ \operatorname{AUC}_{CV} = \frac{1}{K} \sum_{k=1}^K \operatorname{AUC}_k. $$ For small AE count, the number of folds must not exceed the number of AE cases: $$ K \le n_{\mathrm{AE}}. $$ For example, if $$ n_{\mathrm{AE}} = 5, $$ then at most 5-fold stratified cross-validation is possible. --- ## 8. Multiple testing caution If many metrics are tested, $$ z^{(1)}, z^{(2)}, \dots, z^{(m)}, $$ then some may appear promising by chance. For exploratory analysis, report results transparently and avoid strong claims. For confirmatory analysis, use correction methods such as Bonferroni: $$ \alpha_{\mathrm{Bonferroni}} = \frac{\alpha}{m}, $$ or false discovery rate control. In small datasets, it is better to test a small number of biologically motivated metrics than a large radiomics feature set. --- ## 9. Recommended reporting table For each candidate metric, report: $$ \operatorname{median}(z \mid AE), \qquad \operatorname{median}(z \mid NC), $$ $$ p_{\mathrm{MWU}}, $$ $$ \operatorname{AUC}, $$ $$ 95\% \ \operatorname{CI}_{AUC}. $$ A useful table structure is: | Metric | AE median | NC median | MWU p-value | AUC | AUC 95% CI | Direction | |---|---:|---:|---:|---:|---:|---| | SUV95 | | | | | | | | TailExcess95 | | | | | | | | ComponentEntropy95 | | | | | | | | LargestComponentFraction95 | | | | | | | | TailLocalContrast95 | | | | | | | --- ## 10. Practical interpretation A useful biomarker candidate should satisfy several criteria: 1. It visually separates AE and non-AE cases. 2. It has reasonable AUC. 3. Its bootstrap AUC confidence interval is not extremely wide. 4. It is robust across nearby thresholds, for example $Q_{90}$, $Q_{95}$, and $Q_{97.5}$. 5. It is not merely a duplicate of $SUV95$. 6. It has a plausible biological interpretation. 7. It remains stable when influential patients are removed. For spatial SUV metrics, especially promising candidates are: $$ SUV95, $$ $$ \operatorname{TailExcess}_{95}, $$ $$ \operatorname{LargestComponentFraction}_{95}, $$ $$ \operatorname{ComponentEntropy}_{95}, $$ $$ \operatorname{TailSpread}_{95}, $$ $$ \operatorname{LocalContrast}_{95}. $$ The best metric is not necessarily the one with the smallest $p$-value. It should be interpretable, robust, and clinically plausible.