Kaynağa Gözat

Currently working!

Nicholas Schense 1 ay önce
ebeveyn
işleme
810cef2630

+ 28 - 8
analysis/ANALYSES_OVERVIEW.md

@@ -6,6 +6,7 @@ This document summarizes what each analysis in this folder does and what it prod
 
 - Analysis defaults are centralized in `analysis/defaults.py`.
 - The orchestrator CLI in `analysis/evaluate_models.py` is intentionally minimal and focuses on operational controls (`--backend`, `--run-name`, `--skip-noise`).
+- Standalone operational modes include `--longitudinal-breakdown-only`, `--noise-correlation-only`, and `--dataset-summary-only`.
 - Threshold grids, uncertainty percentile grids, noise-factor grids, calibration bins, class index, decision threshold, and bayesian MC passes are sourced from `analysis/defaults.py`.
 
 ## Shared Utilities
@@ -20,7 +21,8 @@ This document summarizes what each analysis in this folder does and what it prod
 - Method: Evaluate metrics across an evenly spaced threshold grid.
 - Main outputs:
   - `performance_threshold_sweep.csv`
-  - `plots/performance_threshold_sweep.png`
+  - `plots/performance_threshold_accuracy.png`
+  - `plots/performance_threshold_f1.png`
 
 ## 2. Uncertainty Cutoff (Raw-Value)
 
@@ -33,7 +35,8 @@ This document summarizes what each analysis in this folder does and what it prod
   - Compute accuracy and F1 for each retained subset.
 - Main outputs:
   - `performance_uncertainty_cutoff.csv`
-  - `plots/performance_uncertainty_cutoff.png`
+  - `plots/performance_uncertainty_cutoff_accuracy.png`
+  - `plots/performance_uncertainty_cutoff_f1.png`
 
 ## 3. Uncertainty Cutoff (Percentile-Ranked)
 
@@ -45,7 +48,8 @@ This document summarizes what each analysis in this folder does and what it prod
   - Plot from least restricted on the left (all samples) to most restricted on the right (lowest-uncertainty subset only).
 - Main outputs:
   - `performance_uncertainty_percentile_cutoff.csv`
-  - `plots/performance_uncertainty_percentile_cutoff.png`
+  - `plots/performance_uncertainty_percentile_cutoff_accuracy.png`
+  - `plots/performance_uncertainty_percentile_cutoff_f1.png`
 
 ## 4. Calibration Analysis
 
@@ -53,7 +57,7 @@ This document summarizes what each analysis in this folder does and what it prod
 - Inputs: Ground-truth labels and predicted probabilities.
 - Method:
   - Reliability binning with configurable bin count.
-  - Compute ECE, MCE, and Brier score.
+  - Compute MCE and Brier score.
 - Main outputs:
   - `calibration_bins.csv`
   - `plots/calibration_reliability.png`
@@ -66,7 +70,7 @@ This document summarizes what each analysis in this folder does and what it prod
   - Merge by image ID.
   - Group metrics by physician confidence level.
   - Plot distributions per rating group.
-- Main outputs include grouped summary CSV files and boxplots for confidence and secondary uncertainty metrics.
+- Main outputs include grouped summary CSV files and boxplots for confidence and standard deviation (ensemble) or predictive uncertainty (bayesian).
 
 ## 6. Longitudinal Stability Analysis
 
@@ -89,10 +93,26 @@ This document summarizes what each analysis in this folder does and what it prod
   - Save visual examples of noised images.
 - Main outputs:
   - `noise_sensitivity.csv`
-  - `plots/noise_sensitivity.png`
-  - `plots/noise_uncertainty.png`
-  - `plots/noise_confidence_certainty.png`
+  - `plots/noise_sensitivity_accuracy.png`
+  - `plots/noise_sensitivity_f1.png`
+  - `plots/noise_confidence.png`
+  - `plots/noise_standard_deviation.png` (ensemble) / `plots/noise_predictive_uncertainty.png` (bayesian)
   - `plots/noise_examples/*_noise_examples.png`
+  - `plots/noise_examples/*_clean_scan_example.png`
+
+## 8. Dataset Composition Summary
+
+- Purpose: Report dataset composition without rerunning full evaluation analyses.
+- Inputs: Raw dataset files and configured train/validation/test split ratios.
+- Method:
+  - Rebuild dataset and split assignment using the configured split seed.
+  - Count total images and positive/negative labels overall.
+  - Count train/validation/test images and per-split class balance.
+  - Compute both overall and split-level percentages.
+- Main outputs:
+  - `dataset_summary.md`
+- CLI:
+  - `python -m analysis.evaluate_models --dataset-summary-only`
 
 ## Plot Report
 

+ 40 - 11
analysis/README.md

@@ -5,7 +5,7 @@
 This folder should contain all the necessary code for the evaluation of the Bayeisan model and the Deep Ensemble. It should generate and save graphs and statistics for this. The included analyses should include
 
 - Performance information (i.e. basic information on accuracy, number correct, number incorrect, F1 score, etc.)
-- Some basic metrics for uncertainty (ECE, MCE)
+- Some basic metrics for uncertainty (MCE, Brier)
 - Physican Confidence Analysis (graphing uncertainty vs physican confidence)
 - Longitudinal Analysis (graphing uncertainty on patients who remained stable CN, stable AD, or switched from CN to AD)
 - Noise introduction analysis (graphing uncertainty on normal image and images with increasing levels of Gaussian noise applied)
@@ -25,9 +25,9 @@ If a backend does not already have `model_evaluation_results.nc`, the pipeline n
 
 Uncertainty analyses are now run using both:
 
-- Confidence-based certainty: `2 * |p - 0.5|` where `0` means very uncertain and `1` means very certain
-- Confidence-based uncertainty: `1 - 2 * |p - 0.5|` so larger values mean more uncertainty, matching std-based plots
-- Secondary uncertainty metric: ensemble uses std across models; bayesian uses predictive entropy across MC samples (`bayesian_torch.utils.util.predictive_entropy`)
+- Confidence
+- Standard deviation (ensemble)
+- Predictive uncertainty (bayesian; computed via `bayesian_torch.utils.util.predictive_entropy`)
 
 ### Current Modules
 
@@ -65,6 +65,26 @@ If you want to skip noise analysis while validating the pipeline:
 python -m analysis.evaluate_models --skip-noise
 ```
 
+If you want to quickly investigate longitudinal cohort breakdown only (without rerunning all analyses):
+
+```bash
+python -m analysis.evaluate_models --longitudinal-breakdown-only
+```
+
+If you want to rerun only the noise uncertainty-vs-accuracy regression/correlation analysis from existing noise CSV outputs:
+
+```bash
+python -m analysis.evaluate_models --noise-correlation-only --run-name run_YYYYMMDD_HHMMSS
+```
+
+If you want to generate only dataset composition documentation (total images, class counts, train/validation/test counts, and percentage breakdowns) without rerunning analyses:
+
+```bash
+python -m analysis.evaluate_models --dataset-summary-only
+```
+
+This writes `dataset_summary.md` into the selected run directory under `analysis_output`.
+
 Most analysis tuning parameters are now intentionally centralized in `analysis/defaults.py` to reduce CLI verbosity and improve maintainability.
 
 ### Output Layout
@@ -93,21 +113,30 @@ alnn_rewrite/analysis_output/
 			longitudinal_confidence_cohort_summary.csv
 			longitudinal_std_cohort_summary.csv
 			noise_sensitivity.csv
+			noise_accuracy_uncertainty_stats.csv
+			noise_accuracy_uncertainty_summary.md
 			plots/
-				performance_threshold_sweep.png
+				performance_threshold_accuracy.png
+				performance_threshold_f1.png
 				calibration_reliability.png
-				performance_uncertainty_cutoff.png
-				performance_uncertainty_percentile_cutoff.png
+				performance_uncertainty_cutoff_accuracy.png
+				performance_uncertainty_cutoff_f1.png
+				performance_uncertainty_percentile_cutoff_accuracy.png
+				performance_uncertainty_percentile_cutoff_f1.png
 				physician_confidence_boxplot.png
 				physician_std_boxplot.png
 				longitudinal_cohort_confidence.png
 				longitudinal_cohort_std.png
-				noise_sensitivity.png
-				noise_uncertainty.png
-				noise_confidence_certainty.png
+				noise_sensitivity_accuracy.png
+				noise_sensitivity_f1.png
+				noise_confidence.png
+				noise_standard_deviation.png (ensemble) / noise_predictive_uncertainty.png (bayesian)
+				noise_accuracy_uncertainty_2d.png
 				noise_examples/
 					ensemble_noise_examples.png
 					bayesian_noise_examples.png
+					ensemble_clean_scan_example.png
+					bayesian_clean_scan_example.png
 		bayesian/
 			... (same structure)
 ```
@@ -119,6 +148,6 @@ Performance analysis now includes two uncertainty cutoff views:
 - A raw-value cutoff sweep that keeps samples with uncertainty below the percentile-derived cutoff value.
 - A percentile-ranked cutoff sweep that is plotted from least restricted (left, all samples) to most restricted (right, lowest-uncertainty subset).
 
-For the noise analysis, the uncertainty plot uses the confidence metric in its uncertainty orientation, so higher values always mean more uncertainty. A separate certainty plot is also saved for direct inspection of `2 * |p - 0.5|`.
+Noise analysis now saves individual series plots (accuracy, F1, confidence, and standard deviation or predictive uncertainty) so each metric is shown on its own graph.
 
 Noise plots now label the x-axis as Gaussian Noise Factor to reflect that values are multipliers on the actual noise sigma.

+ 145 - 53
analysis/analysis_modules.py

@@ -3,6 +3,7 @@
 from __future__ import annotations
 
 from pathlib import Path
+import re
 from typing import Any
 
 import numpy as np
@@ -16,10 +17,11 @@ from .plotting import (
     save_boxplot,
     save_calibration_plot,
     save_performance_threshold_plot,
+    save_performance_threshold_pair_plot,
     save_uncertainty_cutoff_plot,
+    save_uncertainty_cutoff_pair_plot,
 )
 from .runtime import write_json
-from .uncertainty import confidence_uncertainty
 
 
 def _save_table(rows: list[dict[str, Any]], out_path: Path) -> pd.DataFrame:
@@ -39,7 +41,7 @@ def _uncertainty_percentiles(values: np.ndarray) -> np.ndarray:
 
     order = np.argsort(vals)
     ranks = np.empty(n, dtype=float)
-    ranks[order] = np.arange(n, dtype=float)
+    ranks[order] = np.arange(0, n, dtype=float)
     return (ranks / float(n - 1)) * 100.0
 
 
@@ -47,7 +49,7 @@ def _save_ensemble_prediction_debug(
     evaluation: BackendEvaluation,
     output_dir: Path,
 ) -> str:
-    uncertainty_vals = confidence_uncertainty(evaluation.y_prob)
+    uncertainty_vals = np.asarray(evaluation.uncertainty_confidence, dtype=float)
     uncertainty_pct = _uncertainty_percentiles(uncertainty_vals)
     pred_label = (evaluation.y_prob >= DEFAULT_DECISION_THRESHOLD).astype(int)
     true_label = np.asarray(evaluation.y_true, dtype=int)
@@ -60,10 +62,10 @@ def _save_ensemble_prediction_debug(
             "predicted_label": pred_label,
             "actual_label": true_label,
             "is_correct": is_correct,
-            "uncertainty": uncertainty_vals,
-            "uncertainty_percentile": uncertainty_pct,
+            "confidence": uncertainty_vals,
+            "confidence_percentile": uncertainty_pct,
         }
-    ).sort_values("uncertainty", ascending=True)
+    ).sort_values("confidence", ascending=False)
 
     path = output_dir / "ensemble_prediction_debug.csv"
     debug_df.to_csv(path, index=False)
@@ -75,13 +77,18 @@ def _uncertainty_cutoff_analysis(
     output_dir: Path,
     uncertainty_types: list[tuple[str, np.ndarray]],
     cutoff_percentiles: np.ndarray,
-    keep_higher: bool,
     table_filename: str,
-    plot_filename: str,
+    plot_filename_prefix: str,
     title_prefix: str,
     x_label: str,
-    selection_label: str,
 ) -> dict[str, Any]:
+    def _slug(name: str) -> str:
+        s = re.sub(r"[^a-z0-9]+", "_", name.strip().lower())
+        return s.strip("_") or "metric"
+
+    def _confidence_like(name: str) -> bool:
+        return "confidence" in name.strip().lower()
+
     cutoff_rows: list[dict[str, Any]] = []
 
     for uncertainty_name, values in uncertainty_types:
@@ -92,6 +99,12 @@ def _uncertainty_cutoff_analysis(
         values_valid = values[finite_mask]
         y_true_valid = evaluation.y_true[finite_mask]
         y_prob_valid = evaluation.y_prob[finite_mask]
+        keep_higher = _confidence_like(uncertainty_name)
+        selection_label = (
+            "metric >= percentile cutoff (higher is lower uncertainty)"
+            if keep_higher
+            else "metric <= percentile cutoff (lower is lower uncertainty)"
+        )
 
         for cutoff_percentile in cutoff_percentiles:
             cutoff_value = float(np.percentile(values_valid, cutoff_percentile))
@@ -117,35 +130,86 @@ def _uncertainty_cutoff_analysis(
                     "n_retained": retained,
                     "coverage": float(retained / len(values_valid)),
                     "selection_rule": selection_label,
+                    "keep_higher": bool(keep_higher),
                     "accuracy": float(perf["accuracy"]),
                     "f1": float(perf["f1"]),
+                    "n_correct": int(perf["tp"] + perf["tn"]),
+                    "n_incorrect": int(perf["fp"] + perf["fn"]),
                 }
             )
 
     table_path = output_dir / table_filename
-    plot_path = plots_dir(output_dir) / plot_filename
+    accuracy_plot_path = plots_dir(output_dir) / f"{plot_filename_prefix}_accuracy.png"
+    f1_plot_path = plots_dir(output_dir) / f"{plot_filename_prefix}_f1.png"
+    pair_plot_path = plots_dir(output_dir) / f"{plot_filename_prefix}_accuracy_f1.png"
     if not cutoff_rows:
-        return {"table": str(table_path), "plot": str(plot_path), "rows": 0}
+        return {
+            "table": str(table_path),
+            "accuracy_plot": str(accuracy_plot_path),
+            "f1_plot": str(f1_plot_path),
+            "pair_plot": str(pair_plot_path),
+            "rows": 0,
+        }
 
     cutoff_df = pd.DataFrame(cutoff_rows)
     cutoff_df.to_csv(table_path, index=False)
 
     # Restriction level increases from left to right so right-most points are most restricted.
-    if keep_higher:
-        cutoff_df["restriction_level"] = cutoff_df["cutoff_percentile"]
-    else:
-        cutoff_df["restriction_level"] = 100.0 - cutoff_df["cutoff_percentile"]
-
-    save_uncertainty_cutoff_plot(
-        cutoff_df=cutoff_df,
-        title_prefix=title_prefix,
-        x_label=x_label,
-        output_path=plot_path,
+    cutoff_df["restriction_level"] = np.where(
+        cutoff_df["keep_higher"].astype(bool),
+        cutoff_df["cutoff_percentile"],
+        100.0 - cutoff_df["cutoff_percentile"],
     )
 
+    plots_by_uncertainty: dict[str, dict[str, str]] = {}
+    for uncertainty_name in sorted(pd.unique(cutoff_df["uncertainty_type"])):
+        sub_df = cutoff_df[cutoff_df["uncertainty_type"] == uncertainty_name].copy()
+        slug = _slug(str(uncertainty_name))
+        sub_accuracy_plot_path = (
+            plots_dir(output_dir) / f"{plot_filename_prefix}_{slug}_accuracy.png"
+        )
+        sub_f1_plot_path = (
+            plots_dir(output_dir) / f"{plot_filename_prefix}_{slug}_f1.png"
+        )
+        sub_pair_plot_path = (
+            plots_dir(output_dir) / f"{plot_filename_prefix}_{slug}_accuracy_f1.png"
+        )
+
+        save_uncertainty_cutoff_plot(
+            cutoff_df=sub_df,
+            title_prefix=f"{title_prefix} ({uncertainty_name})",
+            x_label=x_label,
+            output_path=sub_accuracy_plot_path,
+            metric_column="accuracy",
+            metric_label="Accuracy",
+            plot_key=f"{plot_filename_prefix}_{slug}_accuracy",
+        )
+        save_uncertainty_cutoff_plot(
+            cutoff_df=sub_df,
+            title_prefix=f"{title_prefix} ({uncertainty_name})",
+            x_label=x_label,
+            output_path=sub_f1_plot_path,
+            metric_column="f1",
+            metric_label="F1",
+            plot_key=f"{plot_filename_prefix}_{slug}_f1",
+        )
+        save_uncertainty_cutoff_pair_plot(
+            cutoff_df=sub_df,
+            title_prefix=f"{title_prefix} ({uncertainty_name})",
+            x_label=x_label,
+            output_path=sub_pair_plot_path,
+            plot_key=f"{plot_filename_prefix}_{slug}_accuracy_f1",
+        )
+
+        plots_by_uncertainty[str(uncertainty_name)] = {
+            "accuracy": str(sub_accuracy_plot_path),
+            "f1": str(sub_f1_plot_path),
+            "accuracy_f1": str(sub_pair_plot_path),
+        }
+
     return {
         "table": str(table_path),
-        "plot": str(plot_path),
+        "plots_by_uncertainty": plots_by_uncertainty,
         "rows": int(len(cutoff_df)),
     }
 
@@ -159,22 +223,46 @@ def run_performance(
     table_path = output_dir / "performance_threshold_sweep.csv"
     df = _save_table(rows, table_path)
 
-    plot_path = plots_dir(output_dir) / "performance_threshold_sweep.png"
+    accuracy_plot_path = plots_dir(output_dir) / "performance_threshold_accuracy.png"
+    f1_plot_path = plots_dir(output_dir) / "performance_threshold_f1.png"
+    pair_plot_path = plots_dir(output_dir) / "performance_threshold_accuracy_f1.png"
     save_performance_threshold_plot(
         df=df,
         backend=evaluation.backend,
-        output_path=plot_path,
+        output_path=accuracy_plot_path,
+        metric_column="accuracy",
+        metric_label="Accuracy",
+        plot_key="performance_threshold_accuracy",
+    )
+    save_performance_threshold_plot(
+        df=df,
+        backend=evaluation.backend,
+        output_path=f1_plot_path,
+        metric_column="f1",
+        metric_label="F1",
+        plot_key="performance_threshold_f1",
+    )
+    save_performance_threshold_pair_plot(
+        df=df,
+        backend=evaluation.backend,
+        output_path=pair_plot_path,
+        plot_key="performance_threshold_accuracy_f1",
     )
 
     best_idx = int(df["f1"].idxmax())
     best = df.iloc[best_idx].to_dict()
 
     cutoff_percentiles = uncertainty_cutoff_percentiles()
-    confidence_uncertainty_vals = confidence_uncertainty(evaluation.y_prob)
+    model_output_vals = np.asarray(evaluation.uncertainty_confidence, dtype=float)
     secondary_uncertainty = np.asarray(evaluation.uncertainty_std, dtype=float)
+    secondary_uncertainty_name = (
+        "predictive uncertainty"
+        if evaluation.uncertainty_metric == "predictive_entropy"
+        else "standard deviation"
+    )
     uncertainty_types = [
-        ("confidence_uncertainty", confidence_uncertainty_vals),
-        (evaluation.uncertainty_metric, secondary_uncertainty),
+        ("confidence", model_output_vals),
+        (secondary_uncertainty_name, secondary_uncertainty),
     ]
 
     uncertainty_cutoff = _uncertainty_cutoff_analysis(
@@ -182,12 +270,10 @@ def run_performance(
         output_dir=output_dir,
         uncertainty_types=uncertainty_types,
         cutoff_percentiles=cutoff_percentiles,
-        keep_higher=False,
         table_filename="performance_uncertainty_cutoff.csv",
-        plot_filename="performance_uncertainty_cutoff.png",
-        title_prefix="Uncertainty Cutoff Percentile",
-        x_label="Restriction Level (0 = all results, 100 = lowest-uncertainty subset)",
-        selection_label="uncertainty <= percentile cutoff",
+        plot_filename_prefix="performance_uncertainty_cutoff",
+        title_prefix="Model Output / Uncertainty Cutoff Percentile",
+        x_label="Restriction Level (0 = all samples, 100 = most restricted subset)",
     )
 
     percentile_cutoffs = uncertainty_cutoff_percentiles()
@@ -196,32 +282,34 @@ def run_performance(
         output_dir=output_dir,
         uncertainty_types=uncertainty_types,
         cutoff_percentiles=percentile_cutoffs,
-        keep_higher=True,
         table_filename="performance_uncertainty_percentile_cutoff.csv",
-        plot_filename="performance_uncertainty_percentile_cutoff.png",
-        title_prefix="Uncertainty Percentile Floor",
-        x_label="Uncertainty Percentile Floor (0 = all results, 100 = highest-uncertainty subset)",
-        selection_label="uncertainty >= percentile floor",
+        plot_filename_prefix="performance_uncertainty_percentile_cutoff",
+        title_prefix="Model Output / Uncertainty Percentile Floor",
+        x_label="Percentile Floor (0 = all samples, 100 = top percentile subset)",
     )
 
     cutoff_table_path = Path(uncertainty_cutoff["table"])
-    cutoff_plot_path = Path(uncertainty_cutoff["plot"])
     percentile_cutoff_table_path = Path(uncertainty_percentile_cutoff["table"])
-    percentile_cutoff_plot_path = Path(uncertainty_percentile_cutoff["plot"])
     summary = {
         "best_by_f1": {
             k: float(v) for k, v in best.items() if isinstance(v, (int, float))
         },
         "table": str(table_path),
-        "plot": str(plot_path),
+        "plots": {
+            "accuracy": str(accuracy_plot_path),
+            "f1": str(f1_plot_path),
+            "accuracy_f1": str(pair_plot_path),
+        },
         "uncertainty_cutoff": {
             "table": str(cutoff_table_path),
-            "plot": str(cutoff_plot_path),
+            "plots_by_uncertainty": uncertainty_cutoff["plots_by_uncertainty"],
             "decision_threshold": DEFAULT_DECISION_THRESHOLD,
         },
         "uncertainty_percentile_cutoff": {
             "table": str(percentile_cutoff_table_path),
-            "plot": str(percentile_cutoff_plot_path),
+            "plots_by_uncertainty": uncertainty_percentile_cutoff[
+                "plots_by_uncertainty"
+            ],
             "decision_threshold": DEFAULT_DECISION_THRESHOLD,
         },
     }
@@ -277,9 +365,9 @@ def run_physician(
         else "std"
     )
     secondary_label = (
-        "Model Predictive Entropy"
+        "Predictive Uncertainty"
         if secondary_key == "predictive_entropy"
-        else "Model Uncertainty Std"
+        else "Standard Deviation"
     )
 
     col = physician_column(clinical_df)
@@ -292,7 +380,7 @@ def run_physician(
     eval_df = pd.DataFrame(
         {
             "Image Data ID": evaluation.image_ids.astype(int),
-            "model_confidence": evaluation.uncertainty_confidence,
+            "model_probability": evaluation.uncertainty_confidence,
             "model_std": evaluation.uncertainty_std,
             "model_prob": evaluation.y_prob,
         }
@@ -304,7 +392,7 @@ def run_physician(
 
     grouped_rows: list[dict[str, Any]] = []
     uncertainty_specs = [
-        ("confidence", "model_confidence", "Model Confidence (2*|p-0.5|)"),
+        ("confidence", "model_probability", "Confidence"),
         (secondary_key, "model_std", secondary_label),
     ]
     ratings = [int(r) for r in sorted(pd.unique(merged[col]))]
@@ -341,7 +429,7 @@ def run_physician(
             tick_labels=[str(r) for r in ratings],
             x_label="Physician Confidence Rating (DXCONFID)",
             y_label=metric_label,
-            title=f"{metric_label} vs Physician Confidence ({evaluation.backend})",
+            title=f"{metric_label} by Physician Confidence Rating ({evaluation.backend})",
             output_path=plot_path,
         )
 
@@ -405,9 +493,9 @@ def run_longitudinal(
         else "std"
     )
     secondary_label = (
-        "Mean Model Predictive Entropy"
+        "Mean Predictive Uncertainty"
         if secondary_key == "predictive_entropy"
-        else "Mean Model Uncertainty Std"
+        else "Mean Standard Deviation"
     )
 
     required = ["Image Data ID", "PTID"]
@@ -469,7 +557,7 @@ def run_longitudinal(
             cohort = "stable_cn"
         elif unique_dx.issubset({"AD"}):
             cohort = "stable_ad"
-        elif first_dx == "CN" and "AD" in unique_dx and last_dx == "AD":
+        elif first_dx == "CN" and last_dx == "AD" and unique_dx.issubset({"CN", "AD"}):
             cohort = "cn_to_ad"
 
         patient_rows.append(
@@ -486,6 +574,10 @@ def run_longitudinal(
         )
 
     patient_df = pd.DataFrame(patient_rows)
+    patient_df = patient_df[
+        patient_df["cohort"].isin(["stable_cn", "stable_ad", "cn_to_ad"])
+    ].copy()
+
     table_path = output_dir / "longitudinal_patient_summary.csv"
     patient_df.to_csv(table_path, index=False)
 
@@ -504,7 +596,7 @@ def run_longitudinal(
 
     cohorts = ["stable_cn", "stable_ad", "cn_to_ad"]
     uncertainty_specs = [
-        ("confidence", "mean_confidence", "Mean Model Confidence"),
+        ("confidence", "mean_confidence", "Mean Confidence"),
         (secondary_key, "mean_std", secondary_label),
     ]
     plot_paths: dict[str, str] = {}
@@ -519,9 +611,9 @@ def run_longitudinal(
         save_boxplot(
             data=values,
             tick_labels=cohorts,
-            x_label="Cohort",
+            x_label="Longitudinal Cohort",
             y_label=metric_label,
-            title=f"Longitudinal Cohort {metric_label} ({evaluation.backend})",
+            title=f"Longitudinal Cohort Comparison: {metric_label} ({evaluation.backend})",
             output_path=plot_path,
         )
         plot_paths[metric_name] = str(plot_path)

+ 18 - 8
analysis/data_access.py

@@ -11,8 +11,6 @@ import pandas as pd
 import xarray as xr
 from bayesian_torch.utils.util import predictive_entropy
 
-from .uncertainty import confidence_certainty
-
 
 @dataclass
 class BackendEvaluation:
@@ -41,7 +39,7 @@ def _resolve_dataset_path(model_output_dir: Path) -> Path:
 def _positive_probability(
     predictions: xr.DataArray,
     class_index: int,
-) -> tuple[np.ndarray, np.ndarray, str]:
+) -> tuple[np.ndarray, np.ndarray, np.ndarray, str]:
     if "img_class" not in predictions.dims:
         raise ValueError("predictions is missing required dim: img_class")
 
@@ -51,10 +49,16 @@ def _positive_probability(
         )
 
     if "model" in predictions.dims:
-        prob_mean = predictions.mean(dim="model").isel(img_class=class_index).values
-        prob_std = predictions.std(dim="model").isel(img_class=class_index).values
+        class_probs = predictions.isel(img_class=class_index)
+        prob_mean = class_probs.mean(dim="model").values
+        # Confidence is defined as distance from 0.5 and averaged across models.
+        class_prob_arr = np.asarray(class_probs.values, dtype=float)
+        model_axis = class_probs.dims.index("model")
+        conf_mean = np.abs(class_prob_arr - 0.5).mean(axis=model_axis)
+        prob_std = class_probs.std(dim="model").values
         return (
             np.asarray(prob_mean, dtype=float),
+            np.asarray(conf_mean, dtype=float),
             np.asarray(prob_std, dtype=float),
             "std",
         )
@@ -62,7 +66,11 @@ def _positive_probability(
     sample_like = [d for d in predictions.dims if d in {"sample", "mc_sample", "draw"}]
     if sample_like:
         dim = str(sample_like[0])
-        prob_mean = predictions.mean(dim=dim).isel(img_class=class_index).values
+        class_probs = predictions.isel(img_class=class_index)
+        prob_mean = class_probs.mean(dim=dim).values
+        class_prob_arr = np.asarray(class_probs.values, dtype=float)
+        sample_axis = class_probs.dims.index(dim)
+        conf_mean = np.abs(class_prob_arr - 0.5).mean(axis=sample_axis)
 
         # For Bayesian MC predictions, uncertainty should come from predictive
         # entropy of the predictive distribution rather than classwise std.
@@ -70,13 +78,16 @@ def _positive_probability(
         entropy_uncertainty = predictive_entropy(np.asarray(mc_preds, dtype=float))
         return (
             np.asarray(prob_mean, dtype=float),
+            np.asarray(conf_mean, dtype=float),
             np.asarray(entropy_uncertainty, dtype=float),
             "predictive_entropy",
         )
 
     prob = predictions.isel(img_class=class_index).values
+    conf = np.abs(np.asarray(prob, dtype=float) - 0.5)
     return (
         np.asarray(prob, dtype=float),
+        np.asarray(conf, dtype=float),
         np.full_like(np.asarray(prob, dtype=float), np.nan),
         "unknown",
     )
@@ -129,10 +140,9 @@ def load_backend_evaluation(
         image_ids = np.arange(length)
 
     y_true = _labels_to_binary(labels, class_index=class_index)
-    y_prob, y_std, uncertainty_metric = _positive_probability(
+    y_prob, conf, y_std, uncertainty_metric = _positive_probability(
         predictions, class_index=class_index
     )
-    conf = confidence_certainty(y_prob)
 
     if len(y_true) != len(y_prob):
         raise ValueError(

+ 157 - 0
analysis/dataset_summary.py

@@ -0,0 +1,157 @@
+from __future__ import annotations
+
+from copy import deepcopy
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import Any
+
+from analysis.data_pipeline import build_dataset, build_dataset_splits
+
+
+def _percent(part: int, whole: int) -> float:
+    if whole <= 0:
+        return 0.0
+    return 100.0 * float(part) / float(whole)
+
+
+def compute_dataset_summary(
+    config: dict[str, Any],
+    root_dir: Path,
+    positive_class_index: int,
+) -> dict[str, Any]:
+    # Force CPU so summary generation works without GPU availability.
+    summary_config = deepcopy(config)
+    summary_config.setdefault("training", {})
+    summary_config["training"]["device"] = "cpu"
+
+    dataset, xls_file = build_dataset(summary_config, root_dir)
+    seed = int(config["data"]["seed"])
+    splits = build_dataset_splits(summary_config, dataset, xls_file, seed=seed)
+
+    split_names = ["train", "validation", "test"]
+    requested_ratios = [float(v) for v in config["data"]["data_splits"]]
+
+    if len(splits) != 3:
+        raise ValueError(f"Expected 3 dataset splits, got {len(splits)}.")
+
+    total_images = int(len(dataset))
+    labels = (dataset.expected_classes[:, positive_class_index] >= 0.5).int()
+
+    splits_summary: list[dict[str, Any]] = []
+    assigned = 0
+    assigned_positive = 0
+
+    for split_name, requested_ratio, subset in zip(
+        split_names,
+        requested_ratios,
+        splits,
+        strict=True,
+    ):
+        indices = list(subset.indices)
+        split_count = int(len(indices))
+        split_positive = int(labels[indices].sum().item()) if split_count > 0 else 0
+        split_negative = split_count - split_positive
+
+        assigned += split_count
+        assigned_positive += split_positive
+
+        splits_summary.append(
+            {
+                "split": split_name,
+                "requested_ratio": requested_ratio,
+                "image_count": split_count,
+                "image_pct_of_dataset": _percent(split_count, total_images),
+                "positive_count": split_positive,
+                "negative_count": split_negative,
+                "positive_pct_within_split": _percent(split_positive, split_count),
+                "negative_pct_within_split": _percent(split_negative, split_count),
+            }
+        )
+
+    if assigned != total_images:
+        raise ValueError(
+            f"Split coverage mismatch: assigned {assigned} images, expected {total_images}."
+        )
+
+    total_positive = assigned_positive
+    total_negative = total_images - total_positive
+
+    return {
+        "generated_utc": datetime.now(timezone.utc).isoformat(),
+        "seed": seed,
+        "positive_class_index": int(positive_class_index),
+        "totals": {
+            "image_count": total_images,
+            "positive_count": total_positive,
+            "negative_count": total_negative,
+            "positive_pct": _percent(total_positive, total_images),
+            "negative_pct": _percent(total_negative, total_images),
+        },
+        "splits": splits_summary,
+    }
+
+
+def _markdown_table(summary: dict[str, Any]) -> str:
+    lines = [
+        "| Split | Requested % | Images | Dataset % | Positive | Negative | Positive % (split) | Negative % (split) |",
+        "|---|---:|---:|---:|---:|---:|---:|---:|",
+    ]
+
+    for item in summary["splits"]:
+        lines.append(
+            "| "
+            f"{item['split'].title()} | "
+            f"{item['requested_ratio'] * 100.0:.2f}% | "
+            f"{item['image_count']} | "
+            f"{item['image_pct_of_dataset']:.2f}% | "
+            f"{item['positive_count']} | "
+            f"{item['negative_count']} | "
+            f"{item['positive_pct_within_split']:.2f}% | "
+            f"{item['negative_pct_within_split']:.2f}% |"
+        )
+
+    return "\n".join(lines)
+
+
+def write_dataset_summary_markdown(summary: dict[str, Any], path: Path) -> None:
+    totals = summary["totals"]
+    lines = [
+        "# Dataset Composition Summary",
+        "",
+        f"Generated (UTC): {summary['generated_utc']}",
+        f"Split seed: {summary['seed']}",
+        f"Positive class index: {summary['positive_class_index']}",
+        "",
+        "## Overall",
+        "",
+        f"- Total images: {totals['image_count']}",
+        f"- Positive images: {totals['positive_count']} ({totals['positive_pct']:.2f}%)",
+        f"- Negative images: {totals['negative_count']} ({totals['negative_pct']:.2f}%)",
+        "",
+        "## Train / Validation / Test Breakdown",
+        "",
+        _markdown_table(summary),
+        "",
+    ]
+    path.write_text("\n".join(lines), encoding="utf-8")
+
+
+def run_dataset_summary(
+    config: dict[str, Any],
+    root_dir: Path,
+    output_dir: Path,
+    positive_class_index: int,
+) -> dict[str, Any]:
+    summary = compute_dataset_summary(
+        config=config,
+        root_dir=root_dir,
+        positive_class_index=positive_class_index,
+    )
+
+    markdown_path = output_dir / "dataset_summary.md"
+    write_dataset_summary_markdown(summary, markdown_path)
+
+    return {
+        "summary_markdown": str(markdown_path),
+        "summary": summary,
+    }

+ 171 - 22
analysis/evaluate_models.py

@@ -14,6 +14,7 @@ from analysis.analysis_modules import (
     run_performance,
     run_physician,
 )
+from analysis.dataset_summary import run_dataset_summary
 from analysis.data_access import load_backend_evaluation, load_clinical_table
 from analysis.defaults import (
     DEFAULT_BACKENDS,
@@ -25,27 +26,43 @@ from analysis.defaults import (
     threshold_grid,
 )
 from analysis.holdout_evaluation import ensure_backend_netcdf
+from analysis.longitudinal_audit import run_longitudinal_breakdown_audit
+from analysis.noise_correlation import run_noise_accuracy_uncertainty_analysis
 from analysis.noise_analysis import run_noise_analysis
 from analysis.runtime import backend_dir, init_runtime_paths, load_config, write_json
 
 
 def _plot_description(filename: str) -> str:
     descriptions = {
-        "performance_threshold_sweep.png": "Accuracy and F1 as the decision threshold varies.",
-        "performance_uncertainty_cutoff.png": "Performance while progressively restricting to lower-uncertainty predictions.",
-        "performance_uncertainty_percentile_cutoff.png": "Percentile-ranked low-uncertainty subset performance from least to most restricted.",
+        "performance_threshold_accuracy.png": "Accuracy as the decision threshold varies.",
+        "performance_threshold_f1.png": "F1 score as the decision threshold varies.",
+        "performance_threshold_accuracy_f1.png": "Accuracy and F1 shown side-by-side as the decision threshold varies.",
+        "performance_uncertainty_cutoff_accuracy.png": "Accuracy while progressively restricting to higher-confidence and uncertainty-metric subsets.",
+        "performance_uncertainty_cutoff_f1.png": "F1 score while progressively restricting to higher-confidence and uncertainty-metric subsets.",
+        "performance_uncertainty_cutoff_accuracy_f1.png": "Accuracy and F1 shown side-by-side across uncertainty-cutoff restriction levels.",
+        "performance_uncertainty_percentile_cutoff_accuracy.png": "Accuracy from least to most restricted percentile-wise subset selection.",
+        "performance_uncertainty_percentile_cutoff_f1.png": "F1 score from least to most restricted percentile-wise subset selection.",
+        "performance_uncertainty_percentile_cutoff_accuracy_f1.png": "Accuracy and F1 shown side-by-side across percentile-floor restriction levels.",
         "calibration_reliability.png": "Reliability diagram comparing predicted probability to empirical outcome frequency.",
-        "physician_confidence_boxplot.png": "Model confidence grouped by physician confidence ratings.",
-        "physician_std_boxplot.png": "Model secondary uncertainty grouped by physician confidence ratings.",
-        "physician_predictive_entropy_boxplot.png": "Predictive entropy grouped by physician confidence ratings.",
-        "longitudinal_cohort_confidence.png": "Longitudinal cohort comparison using model confidence.",
-        "longitudinal_cohort_std.png": "Longitudinal cohort comparison using ensemble standard deviation uncertainty.",
-        "longitudinal_cohort_predictive_entropy.png": "Longitudinal cohort comparison using predictive entropy uncertainty.",
-        "noise_sensitivity.png": "Performance metrics across increasing Gaussian noise factors.",
-        "noise_uncertainty.png": "Uncertainty metrics across increasing Gaussian noise factors.",
-        "noise_confidence_certainty.png": "Confidence certainty trend across increasing Gaussian noise factors.",
-        "ensemble_noise_examples.png": "Representative image slices with progressively larger Gaussian noise factors.",
-        "bayesian_noise_examples.png": "Representative image slices with progressively larger Gaussian noise factors.",
+        "physician_confidence_boxplot.png": "Confidence grouped by physician confidence ratings.",
+        "physician_std_boxplot.png": "Standard deviation grouped by physician confidence ratings.",
+        "physician_predictive_entropy_boxplot.png": "Predictive uncertainty grouped by physician confidence ratings.",
+        "longitudinal_cohort_confidence.png": "Longitudinal cohort comparison using confidence.",
+        "longitudinal_cohort_std.png": "Longitudinal cohort comparison using standard deviation.",
+        "longitudinal_cohort_predictive_entropy.png": "Longitudinal cohort comparison using predictive uncertainty.",
+        "noise_sensitivity_accuracy.png": "Accuracy trend across increasing Gaussian noise factors.",
+        "noise_sensitivity_f1.png": "F1 trend across increasing Gaussian noise factors.",
+        "noise_sensitivity_accuracy_f1.png": "Accuracy and F1 shown side-by-side across increasing Gaussian noise factors.",
+        "noise_confidence.png": "Confidence trend across increasing Gaussian noise factors.",
+        "noise_standard_deviation.png": "Standard deviation trend across increasing Gaussian noise factors.",
+        "noise_confidence_standard_deviation.png": "Confidence and standard deviation shown side-by-side across increasing Gaussian noise factors.",
+        "noise_predictive_uncertainty.png": "Predictive uncertainty trend across increasing Gaussian noise factors.",
+        "noise_confidence_predictive_uncertainty.png": "Confidence and predictive uncertainty shown side-by-side across increasing Gaussian noise factors.",
+        "noise_accuracy_uncertainty_2d.png": "2D uncertainty-vs-accuracy relationship with linear fit (noise factor encoded by color).",
+        "ensemble_noise_examples.png": "Representative noisy image slices across selected Gaussian noise factors.",
+        "bayesian_noise_examples.png": "Representative noisy image slices across selected Gaussian noise factors.",
+        "ensemble_clean_scan_example.png": "Example clean scan image with no added noise.",
+        "bayesian_clean_scan_example.png": "Example clean scan image with no added noise.",
     }
     return descriptions.get(filename, "Generated analysis plot.")
 
@@ -101,7 +118,84 @@ def _parse_args() -> argparse.Namespace:
         action="store_true",
         help="Skip Gaussian noise sensitivity analysis.",
     )
-    return parser.parse_args()
+    parser.add_argument(
+        "--longitudinal-breakdown-only",
+        action="store_true",
+        help=(
+            "Run only longitudinal cohort breakdown audit from existing model "
+            "evaluation outputs (no full analysis rerun)."
+        ),
+    )
+    parser.add_argument(
+        "--noise-correlation-only",
+        action="store_true",
+        help=(
+            "Run only the noise uncertainty-vs-accuracy correlation/regression "
+            "analysis from an existing noise_sensitivity.csv per backend."
+        ),
+    )
+    parser.add_argument(
+        "--dataset-summary-only",
+        action="store_true",
+        help=(
+            "Generate only dataset composition summary documentation "
+            "(overall and train/validation/test class breakdown)."
+        ),
+    )
+
+    args = parser.parse_args()
+    only_modes = [
+        bool(args.longitudinal_breakdown_only),
+        bool(args.noise_correlation_only),
+        bool(args.dataset_summary_only),
+    ]
+    if sum(only_modes) > 1:
+        parser.error(
+            "Only one of --longitudinal-breakdown-only, "
+            "--noise-correlation-only, and --dataset-summary-only may be used at once."
+        )
+
+    return args
+
+
+def _run_longitudinal_breakdown_only(
+    config: dict[str, Any],
+    backend: str,
+    clinical_df: pd.DataFrame,
+    out_dir: Path,
+) -> dict[str, Any]:
+    evaluation = load_backend_evaluation(
+        config=config,
+        backend=backend,
+        class_index=DEFAULT_POSITIVE_CLASS_INDEX,
+    )
+    summary = run_longitudinal_breakdown_audit(
+        evaluation=evaluation,
+        clinical_df=clinical_df,
+        output_dir=out_dir,
+    )
+    write_json(out_dir / "longitudinal_breakdown_backend_summary.json", summary)
+    return summary
+
+
+def _run_noise_correlation_only(
+    backend: str,
+    out_dir: Path,
+) -> dict[str, Any]:
+    noise_table_path = out_dir / "noise_sensitivity.csv"
+    if not noise_table_path.exists():
+        raise FileNotFoundError(
+            f"Expected existing noise table for --noise-correlation-only: {noise_table_path}"
+        )
+
+    noise_df = pd.read_csv(noise_table_path)
+    summary = run_noise_accuracy_uncertainty_analysis(
+        noise_df=noise_df,
+        backend=backend,
+        output_dir=out_dir,
+    )
+    write_json(out_dir / "noise_accuracy_uncertainty_backend_summary.json", summary)
+    return summary
 
 
 def _run_backend(
@@ -158,6 +252,10 @@ def _run_backend(
 
     if skip_noise:
         summary["noise"] = {"skipped": True, "reason": "--skip-noise supplied"}
+        summary["noise_accuracy_uncertainty"] = {
+            "skipped": True,
+            "reason": "Noise analysis skipped, so no noise table available.",
+        }
     else:
         try:
             summary["noise"] = run_noise_analysis(
@@ -171,11 +269,25 @@ def _run_backend(
                 calibration_bins=DEFAULT_CALIBRATION_BINS,
                 bayesian_mc_passes=DEFAULT_BAYESIAN_MC_PASSES,
             )
+
+            noise_table_path = Path(str(summary["noise"]["table"]))
+            noise_df = pd.read_csv(noise_table_path)
+            summary["noise_accuracy_uncertainty"] = (
+                run_noise_accuracy_uncertainty_analysis(
+                    noise_df=noise_df,
+                    backend=backend,
+                    output_dir=out_dir,
+                )
+            )
         except Exception as exc:
             summary["noise"] = {
                 "skipped": True,
                 "reason": f"Noise analysis failed: {exc}",
             }
+            summary["noise_accuracy_uncertainty"] = {
+                "skipped": True,
+                "reason": f"Noise relationship analysis failed: {exc}",
+            }
 
     report_path = _write_backend_plot_report(backend=backend, out_dir=out_dir)
     summary["plots_report"] = str(report_path)
@@ -194,6 +306,19 @@ def main() -> None:
     manifest: dict[str, Any] = {
         "run_dir": str(paths.run_dir),
         "output_root": str(paths.output_root),
+        "mode": (
+            "dataset_summary_only"
+            if bool(args.dataset_summary_only)
+            else (
+                "longitudinal_breakdown_only"
+                if bool(args.longitudinal_breakdown_only)
+                else (
+                    "noise_correlation_only"
+                    if bool(args.noise_correlation_only)
+                    else "full"
+                )
+            )
+        ),
         "positive_class_index": DEFAULT_POSITIVE_CLASS_INDEX,
         "threshold_sweep": {
             "values": [float(v) for v in threshold_grid().tolist()],
@@ -205,16 +330,40 @@ def main() -> None:
         "backends": {},
     }
 
-    for backend in args.backend:
-        out_dir = backend_dir(paths, backend)
-        manifest["backends"][backend] = _run_backend(
+    if args.dataset_summary_only:
+        manifest["dataset_summary"] = run_dataset_summary(
             config=config,
             root_dir=paths.root_dir,
-            backend=backend,
-            clinical_df=clinical_df,
-            skip_noise=bool(args.skip_noise),
-            out_dir=out_dir,
+            output_dir=paths.run_dir,
+            positive_class_index=DEFAULT_POSITIVE_CLASS_INDEX,
         )
+        write_json(paths.run_dir / "run_manifest.json", manifest)
+        print(f"Dataset summary complete. Results saved to {paths.run_dir}")
+        return
+
+    for backend in args.backend:
+        out_dir = backend_dir(paths, backend)
+        if args.longitudinal_breakdown_only:
+            manifest["backends"][backend] = _run_longitudinal_breakdown_only(
+                config=config,
+                backend=backend,
+                clinical_df=clinical_df,
+                out_dir=out_dir,
+            )
+        elif args.noise_correlation_only:
+            manifest["backends"][backend] = _run_noise_correlation_only(
+                backend=backend,
+                out_dir=out_dir,
+            )
+        else:
+            manifest["backends"][backend] = _run_backend(
+                config=config,
+                root_dir=paths.root_dir,
+                backend=backend,
+                clinical_df=clinical_df,
+                skip_noise=bool(args.skip_noise),
+                out_dir=out_dir,
+            )
 
     write_json(paths.run_dir / "run_manifest.json", manifest)
     print(f"Analysis complete. Results saved to {paths.run_dir}")

+ 295 - 0
analysis/longitudinal_audit.py

@@ -0,0 +1,295 @@
+# pyright: basic
+
+from __future__ import annotations
+
+from pathlib import Path
+from typing import Any
+
+import numpy as np
+import pandas as pd
+
+from .data_access import BackendEvaluation
+from .runtime import write_json
+
+
+def _normalize_dx(value: Any) -> str:
+    if value is None or (isinstance(value, float) and np.isnan(value)):
+        return ""
+
+    v = str(value).strip().upper()
+    if v in {"NL", "NORMAL"}:
+        return "CN"
+    return v
+
+
+def _is_mci(value: str) -> bool:
+    return "MCI" in str(value).strip().upper()
+
+
+def _assign_cohort(diagnoses: list[str]) -> str:
+    if len(diagnoses) < 2:
+        return "insufficient_visits"
+
+    first_dx = diagnoses[0]
+    last_dx = diagnoses[-1]
+    unique_dx = set(diagnoses)
+
+    if unique_dx.issubset({"CN"}):
+        return "stable_cn"
+    if unique_dx.issubset({"AD"}):
+        return "stable_ad"
+    if all(_is_mci(dx) for dx in unique_dx):
+        return "stable_mci"
+    if first_dx == "CN" and "AD" in unique_dx and last_dx == "AD":
+        return "cn_to_ad"
+    if first_dx == "CN" and _is_mci(last_dx):
+        return "cn_to_mci"
+    if _is_mci(first_dx) and last_dx == "AD":
+        return "mci_to_ad"
+
+    return "other"
+
+
+def _prepare_clinical_longitudinal_table(clinical_df: pd.DataFrame) -> pd.DataFrame:
+    required = ["Image Data ID", "PTID"]
+    missing = [c for c in required if c not in clinical_df.columns]
+    if missing:
+        raise KeyError(f"Missing columns for longitudinal analysis: {missing}")
+
+    diagnosis_col = None
+    for candidate in ["Class", "DX", "Diagnosis"]:
+        if candidate in clinical_df.columns:
+            diagnosis_col = candidate
+            break
+    if diagnosis_col is None:
+        raise KeyError(
+            "No diagnosis column found. Expected one of: Class, DX, Diagnosis"
+        )
+
+    work = clinical_df[
+        ["Image Data ID", "PTID", diagnosis_col]
+        + [c for c in ["EXAMDATE"] if c in clinical_df.columns]
+    ].copy()
+
+    work["Image Data ID"] = pd.to_numeric(work["Image Data ID"], errors="coerce")
+    work = work.dropna(subset=["Image Data ID", "PTID"])
+    work["Image Data ID"] = work["Image Data ID"].astype(int)
+    work["PTID"] = work["PTID"].astype(str).str.strip()
+    work["diagnosis"] = work[diagnosis_col].map(_normalize_dx)
+
+    if "EXAMDATE" in work.columns:
+        work["EXAMDATE"] = pd.to_datetime(
+            work["EXAMDATE"],
+            errors="coerce",
+            format="mixed",
+        )
+        work = work.sort_values(["PTID", "EXAMDATE"], na_position="last")
+    else:
+        work = work.sort_values(["PTID", "Image Data ID"])
+
+    return work
+
+
+def _build_patient_table(work: pd.DataFrame) -> pd.DataFrame:
+    rows: list[dict[str, Any]] = []
+    for ptid, group in work.groupby("PTID"):
+        diagnoses = [d for d in group["diagnosis"].tolist() if d]
+        cohort = _assign_cohort(diagnoses)
+        first_dx = diagnoses[0] if diagnoses else ""
+        last_dx = diagnoses[-1] if diagnoses else ""
+        rows.append(
+            {
+                "PTID": ptid,
+                "n_visits": int(len(group)),
+                "n_labeled_visits": int(len(diagnoses)),
+                "first_dx": first_dx,
+                "last_dx": last_dx,
+                "cohort": cohort,
+            }
+        )
+
+    if not rows:
+        return pd.DataFrame(
+            columns=[
+                "PTID",
+                "n_visits",
+                "n_labeled_visits",
+                "first_dx",
+                "last_dx",
+                "cohort",
+            ]
+        )
+
+    return pd.DataFrame(rows)
+
+
+def _count_table(df: pd.DataFrame, column: str, count_name: str) -> pd.DataFrame:
+    if column not in df.columns or df.empty:
+        return pd.DataFrame({column: [], count_name: []})
+    return (
+        df.groupby(column, dropna=False)
+        .size()
+        .rename(count_name)
+        .reset_index()
+        .sort_values(count_name, ascending=False)
+    )
+
+
+def run_longitudinal_breakdown_audit(
+    evaluation: BackendEvaluation,
+    clinical_df: pd.DataFrame,
+    output_dir: Path,
+) -> dict[str, Any]:
+    output_dir.mkdir(parents=True, exist_ok=True)
+
+    full_table = _prepare_clinical_longitudinal_table(clinical_df)
+    evaluable_table = full_table[
+        full_table["Image Data ID"].isin(evaluation.image_ids.astype(int))
+    ].copy()
+
+    full_patient = _build_patient_table(full_table)
+    evaluable_patient = _build_patient_table(evaluable_table)
+
+    full_patient_min2 = full_patient[full_patient["n_labeled_visits"] >= 2].copy()
+    evaluable_patient_min2 = evaluable_patient[
+        evaluable_patient["n_labeled_visits"] >= 2
+    ].copy()
+
+    full_diagnosis_counts = _count_table(full_table, "diagnosis", "n_rows")
+    evaluable_diagnosis_counts = _count_table(evaluable_table, "diagnosis", "n_rows")
+
+    full_transition_counts = (
+        full_patient_min2.groupby(["first_dx", "last_dx"])
+        .size()
+        .rename("n_patients")
+        .reset_index()
+    )
+    full_transition_counts = full_transition_counts.sort_values(
+        "n_patients", ascending=False
+    )
+
+    evaluable_transition_counts = (
+        evaluable_patient_min2.groupby(["first_dx", "last_dx"])
+        .size()
+        .rename("n_patients")
+        .reset_index()
+    )
+    evaluable_transition_counts = evaluable_transition_counts.sort_values(
+        "n_patients", ascending=False
+    )
+
+    full_cohort_counts = _count_table(full_patient_min2, "cohort", "n_patients")
+    evaluable_cohort_counts = _count_table(
+        evaluable_patient_min2, "cohort", "n_patients"
+    )
+
+    paths = {
+        "full_diagnosis_counts": output_dir / "longitudinal_diagnosis_counts_all.csv",
+        "evaluable_diagnosis_counts": output_dir
+        / "longitudinal_diagnosis_counts_evaluable.csv",
+        "full_transition_counts": output_dir / "longitudinal_transition_counts_all.csv",
+        "evaluable_transition_counts": output_dir
+        / "longitudinal_transition_counts_evaluable.csv",
+        "full_cohort_counts": output_dir / "longitudinal_cohort_counts_all.csv",
+        "evaluable_cohort_counts": output_dir
+        / "longitudinal_cohort_counts_evaluable.csv",
+        "full_patient_table": output_dir / "longitudinal_patient_breakdown_all.csv",
+        "evaluable_patient_table": output_dir
+        / "longitudinal_patient_breakdown_evaluable.csv",
+        "summary_md": output_dir / "longitudinal_breakdown_summary.md",
+        "summary_json": output_dir / "longitudinal_breakdown_summary.json",
+    }
+
+    full_diagnosis_counts.to_csv(paths["full_diagnosis_counts"], index=False)
+    evaluable_diagnosis_counts.to_csv(paths["evaluable_diagnosis_counts"], index=False)
+    full_transition_counts.to_csv(paths["full_transition_counts"], index=False)
+    evaluable_transition_counts.to_csv(
+        paths["evaluable_transition_counts"], index=False
+    )
+    full_cohort_counts.to_csv(paths["full_cohort_counts"], index=False)
+    evaluable_cohort_counts.to_csv(paths["evaluable_cohort_counts"], index=False)
+    full_patient.to_csv(paths["full_patient_table"], index=False)
+    evaluable_patient.to_csv(paths["evaluable_patient_table"], index=False)
+
+    summary = {
+        "backend": evaluation.backend,
+        "n_evaluation_images": int(len(evaluation.image_ids)),
+        "n_clinical_rows_total": int(len(full_table)),
+        "n_clinical_rows_evaluable": int(len(evaluable_table)),
+        "n_patients_total": (
+            int(full_patient["PTID"].nunique()) if not full_patient.empty else 0
+        ),
+        "n_patients_evaluable": (
+            int(evaluable_patient["PTID"].nunique())
+            if not evaluable_patient.empty
+            else 0
+        ),
+        "n_patients_total_min2_visits": int(len(full_patient_min2)),
+        "n_patients_evaluable_min2_visits": int(len(evaluable_patient_min2)),
+        "n_patients_evaluable_single_visit_only": (
+            int((evaluable_patient["n_labeled_visits"] < 2).sum())
+            if not evaluable_patient.empty
+            else 0
+        ),
+        "tables": {
+            "full_diagnosis_counts": str(paths["full_diagnosis_counts"]),
+            "evaluable_diagnosis_counts": str(paths["evaluable_diagnosis_counts"]),
+            "full_transition_counts": str(paths["full_transition_counts"]),
+            "evaluable_transition_counts": str(paths["evaluable_transition_counts"]),
+            "full_cohort_counts": str(paths["full_cohort_counts"]),
+            "evaluable_cohort_counts": str(paths["evaluable_cohort_counts"]),
+            "full_patient_table": str(paths["full_patient_table"]),
+            "evaluable_patient_table": str(paths["evaluable_patient_table"]),
+        },
+    }
+    write_json(paths["summary_json"], summary)
+
+    cohort_text = "(none)"
+    if not evaluable_cohort_counts.empty:
+        cohort_text = "\n".join(
+            [
+                f"- {str(r['cohort'])}: {int(r['n_patients'])}"
+                for _, r in evaluable_cohort_counts.iterrows()
+            ]
+        )
+
+    lines = [
+        f"# Longitudinal Breakdown Audit ({evaluation.backend})",
+        "",
+        "This report explains the patient/cohort counts used by the longitudinal analysis.",
+        "",
+        "## Key Counts",
+        f"- Evaluation images available: {summary['n_evaluation_images']}",
+        f"- Clinical rows total: {summary['n_clinical_rows_total']}",
+        f"- Clinical rows overlapping evaluated images: {summary['n_clinical_rows_evaluable']}",
+        f"- Patients total (any visits): {summary['n_patients_total']}",
+        f"- Patients with overlapping evaluated images: {summary['n_patients_evaluable']}",
+        f"- Patients with >=2 labeled visits (full clinical): {summary['n_patients_total_min2_visits']}",
+        f"- Patients with >=2 labeled visits (evaluable overlap): {summary['n_patients_evaluable_min2_visits']}",
+        f"- Overlap patients with only 1 labeled visit: {summary['n_patients_evaluable_single_visit_only']}",
+        "",
+        "## Evaluable Cohort Counts (>=2 labeled visits)",
+        cohort_text,
+        "",
+        "## Interpretation",
+        "- The longitudinal plot uses only patients with at least two labeled visits after intersection with the evaluated image IDs.",
+        "- If MCI transition cohorts are missing, it typically means those patients do not have enough overlapping evaluated visits (or labels) in this run.",
+        "- Use the CSV tables below to inspect diagnosis and transition distributions in full clinical data versus evaluable overlap.",
+        "",
+        "## Output Tables",
+        f"- {paths['full_diagnosis_counts'].name}",
+        f"- {paths['evaluable_diagnosis_counts'].name}",
+        f"- {paths['full_transition_counts'].name}",
+        f"- {paths['evaluable_transition_counts'].name}",
+        f"- {paths['full_cohort_counts'].name}",
+        f"- {paths['evaluable_cohort_counts'].name}",
+        f"- {paths['full_patient_table'].name}",
+        f"- {paths['evaluable_patient_table'].name}",
+    ]
+    paths["summary_md"].write_text("\n".join(lines), encoding="utf-8")
+
+    return {
+        **summary,
+        "summary_markdown": str(paths["summary_md"]),
+        "summary_json": str(paths["summary_json"]),
+    }

+ 3 - 4
analysis/metrics.py

@@ -5,7 +5,9 @@ from __future__ import annotations
 import numpy as np
 
 
-def binary_confusion(y_true: np.ndarray, y_prob: np.ndarray, threshold: float) -> dict[str, int]:
+def binary_confusion(
+    y_true: np.ndarray, y_prob: np.ndarray, threshold: float
+) -> dict[str, int]:
     y_true_int = y_true.astype(int)
     y_pred = (y_prob >= threshold).astype(int)
 
@@ -71,7 +73,6 @@ def calibration_stats(
 
     edges = np.linspace(0.0, 1.0, bins + 1)
     bin_data: list[tuple[float, float, int]] = []
-    ece = 0.0
     mce = 0.0
 
     n = len(y_prob_f)
@@ -91,7 +92,6 @@ def calibration_stats(
         mean_conf = float(y_prob_f[mask].mean())
         frac_pos = float(y_true_int[mask].mean())
         gap = abs(frac_pos - mean_conf)
-        ece += (count / n) * gap
         mce = max(mce, gap)
 
         bin_data.append((mean_conf, frac_pos, count))
@@ -99,7 +99,6 @@ def calibration_stats(
     brier = float(np.mean((y_prob_f - y_true_int) ** 2))
 
     summary = {
-        "ece": float(ece),
         "mce": float(mce),
         "brier": brier,
         "bins": float(bins),

+ 142 - 51
analysis/noise_analysis.py

@@ -12,17 +12,17 @@ from bayesian_torch.utils.util import predictive_entropy
 
 from model.cnn import CNN3D
 
-from .data_pipeline import build_dataset_and_test_loader
+from .data_pipeline import build_holdout_loader
 from .metrics import calibration_stats, performance_at_threshold
 from .model_utils import configure_bayesian_sampling_mode
 from .plotting import (
     plots_dir,
+    save_clean_scan_image,
     save_noise_example_grid,
+    save_metric_pair_plot,
     save_noise_metrics_plot,
 )
 from .runtime import write_json
-from .uncertainty import confidence_certainty, confidence_uncertainty
-
 
 
 def _apply_scaled_noise(
@@ -121,12 +121,13 @@ def _infer_with_noise_ensemble(
     sigma: float,
     intensity_range: float,
     class_index: int,
-) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
+) -> tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
     if not models:
         raise ValueError("No ensemble models were provided for noise inference")
 
     device = next(models[0].parameters()).device
     all_probs: list[float] = []
+    all_confidence: list[float] = []
     all_stds: list[float] = []
     all_true: list[int] = []
 
@@ -143,14 +144,21 @@ def _infer_with_noise_ensemble(
 
             pred_mat = np.stack(preds, axis=0)
             mean = pred_mat.mean(axis=0)
+            confidence = np.abs(pred_mat - 0.5).mean(axis=0)
             std = pred_mat.std(axis=0)
             true = labels_device[:, class_index].detach().cpu().numpy().astype(int)
 
             all_probs.extend(mean.tolist())
+            all_confidence.extend(confidence.tolist())
             all_stds.extend(std.tolist())
             all_true.extend(true.tolist())
 
-    return np.asarray(all_true), np.asarray(all_probs), np.asarray(all_stds)
+    return (
+        np.asarray(all_true),
+        np.asarray(all_probs),
+        np.asarray(all_confidence),
+        np.asarray(all_stds),
+    )
 
 
 def _infer_with_noise_bayesian(
@@ -160,9 +168,10 @@ def _infer_with_noise_bayesian(
     intensity_range: float,
     class_index: int,
     mc_passes: int,
-) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
+) -> tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
     device = next(model.parameters()).device
     all_probs: list[float] = []
+    all_confidence: list[float] = []
     all_stds: list[float] = []
     all_true: list[int] = []
 
@@ -179,14 +188,21 @@ def _infer_with_noise_bayesian(
 
             draw_mat = np.stack(draws, axis=0)  # (mc_passes, batch, classes)
             mean = draw_mat.mean(axis=0)[:, class_index]
+            confidence = np.abs(draw_mat[:, :, class_index] - 0.5).mean(axis=0)
             entropy_uncertainty = predictive_entropy(draw_mat)
             true = labels_device[:, class_index].detach().cpu().numpy().astype(int)
 
             all_probs.extend(mean.tolist())
+            all_confidence.extend(np.asarray(confidence, dtype=float).tolist())
             all_stds.extend(np.asarray(entropy_uncertainty, dtype=float).tolist())
             all_true.extend(true.tolist())
 
-    return np.asarray(all_true), np.asarray(all_probs), np.asarray(all_stds)
+    return (
+        np.asarray(all_true),
+        np.asarray(all_probs),
+        np.asarray(all_confidence),
+        np.asarray(all_stds),
+    )
 
 
 def run_noise_analysis(
@@ -201,7 +217,7 @@ def run_noise_analysis(
     bayesian_mc_passes: int,
 ) -> dict[str, Any]:
     noise_sigmas = _uniform_sigma_schedule(noise_sigmas)
-    dataset, test_loader = build_dataset_and_test_loader(
+    test_loader = build_holdout_loader(
         config=config,
         root_dir=root_dir,
         seed=int(config["data"]["seed"]),
@@ -222,7 +238,7 @@ def run_noise_analysis(
         models = _load_ensemble_models(config)
         example_rows: list[tuple[float, torch.Tensor]] = []
         for sigma in noise_sigmas:
-            y_true, y_prob, y_std = _infer_with_noise_ensemble(
+            y_true, y_prob, y_confidence, y_std = _infer_with_noise_ensemble(
                 test_loader,
                 models,
                 sigma,
@@ -237,14 +253,9 @@ def run_noise_analysis(
                     "noise_factor": float(sigma),
                     "accuracy": float(perf["accuracy"]),
                     "f1": float(perf["f1"]),
-                    "ece": float(cal["ece"]),
                     "mce": float(cal["mce"]),
-                    "mean_confidence_certainty": float(
-                        np.nanmean(confidence_certainty(y_prob))
-                    ),
-                    "mean_confidence_uncertainty": float(
-                        np.nanmean(confidence_uncertainty(y_prob))
-                    ),
+                    "mean_confidence": float(np.nanmean(y_confidence)),
+                    "mean_model_output_probability": float(np.nanmean(y_prob)),
                     "mean_std": float(np.nanmean(y_std)),
                     "mean_predictive_entropy": float("nan"),
                     "mri_intensity_range": float(intensity_range),
@@ -265,12 +276,18 @@ def run_noise_analysis(
             noisy_by_sigma=example_rows,
             output_path=examples_dir / f"{backend}_noise_examples.png",
             title=f"{backend.title()} Noise Examples",
+            max_images=9,
+            n_rows=2,
+        )
+        save_clean_scan_image(
+            original_mri=original_mri,
+            output_path=examples_dir / f"{backend}_clean_scan_example.png",
         )
     elif backend == "bayesian":
         model = _load_bayesian_model(config)
         example_rows = []
         for sigma in noise_sigmas:
-            y_true, y_prob, y_std = _infer_with_noise_bayesian(
+            y_true, y_prob, y_confidence, y_std = _infer_with_noise_bayesian(
                 test_loader,
                 model,
                 sigma,
@@ -286,14 +303,9 @@ def run_noise_analysis(
                     "noise_factor": float(sigma),
                     "accuracy": float(perf["accuracy"]),
                     "f1": float(perf["f1"]),
-                    "ece": float(cal["ece"]),
                     "mce": float(cal["mce"]),
-                    "mean_confidence_certainty": float(
-                        np.nanmean(confidence_certainty(y_prob))
-                    ),
-                    "mean_confidence_uncertainty": float(
-                        np.nanmean(confidence_uncertainty(y_prob))
-                    ),
+                    "mean_confidence": float(np.nanmean(y_confidence)),
+                    "mean_model_output_probability": float(np.nanmean(y_prob)),
                     # Compatibility field name retained for downstream code.
                     "mean_std": float(np.nanmean(y_std)),
                     "mean_predictive_entropy": float(np.nanmean(y_std)),
@@ -315,6 +327,12 @@ def run_noise_analysis(
             noisy_by_sigma=example_rows,
             output_path=examples_dir / f"{backend}_noise_examples.png",
             title=f"{backend.title()} Noise Examples",
+            max_images=9,
+            n_rows=2,
+        )
+        save_clean_scan_image(
+            original_mri=original_mri,
+            output_path=examples_dir / f"{backend}_clean_scan_example.png",
         )
     else:
         raise ValueError(f"Unsupported backend for noise analysis: {backend}")
@@ -323,49 +341,122 @@ def run_noise_analysis(
     csv_path = output_dir / "noise_sensitivity.csv"
     df.to_csv(csv_path, index=False)
 
-    plot_path = out_plots_dir / "noise_sensitivity.png"
-    uncertainty_plot_path = out_plots_dir / "noise_uncertainty.png"
-    certainty_plot_path = out_plots_dir / "noise_confidence_certainty.png"
+    accuracy_plot_path = out_plots_dir / "noise_sensitivity_accuracy.png"
+    f1_plot_path = out_plots_dir / "noise_sensitivity_f1.png"
+    pair_plot_path = out_plots_dir / "noise_sensitivity_accuracy_f1.png"
+    confidence_plot_path = out_plots_dir / "noise_confidence.png"
+    confidence_uncertainty_pair_path = (
+        out_plots_dir / "noise_confidence_predictive_uncertainty.png"
+        if backend == "bayesian"
+        else out_plots_dir / "noise_confidence_standard_deviation.png"
+    )
+    secondary_plot_name = (
+        "noise_predictive_uncertainty.png"
+        if backend == "bayesian"
+        else "noise_standard_deviation.png"
+    )
+    secondary_plot_path = out_plots_dir / secondary_plot_name
 
     save_noise_metrics_plot(
         x=df["noise_factor"],
-        y_by_label=[
-            (df["accuracy"], "o", "accuracy"),
-            (df["f1"], "s", "f1"),
-            (df["ece"], "^", "ece"),
-        ],
+        y=df["accuracy"],
+        legend_label="Accuracy",
+        marker="o",
+        x_label="Gaussian Noise Factor",
+        y_label="Accuracy",
+        title=f"Accuracy vs Noise ({backend})",
+        output_path=accuracy_plot_path,
+        plot_key="noise_sensitivity_accuracy",
+    )
+    save_noise_metrics_plot(
+        x=df["noise_factor"],
+        y=df["f1"],
+        legend_label="F1",
+        marker="s",
+        x_label="Gaussian Noise Factor",
+        y_label="F1",
+        title=f"F1 vs Noise ({backend})",
+        output_path=f1_plot_path,
+        plot_key="noise_sensitivity_f1",
+    )
+    save_metric_pair_plot(
+        x=df["noise_factor"],
+        left_y=df["accuracy"],
+        right_y=df["f1"],
+        left_label="Accuracy",
+        right_label="F1",
         x_label="Gaussian Noise Factor",
-        y_label="Score",
-        title=f"Noise Sensitivity ({backend})",
-        output_path=plot_path,
+        y_label="Accuracy/F1",
+        title=f"Accuracy and F1 vs Noise ({backend})",
+        output_path=pair_plot_path,
+        plot_key="noise_sensitivity_accuracy_f1",
     )
+
+    secondary_label = (
+        "Predictive Uncertainty" if backend == "bayesian" else "Standard Deviation"
+    )
+
     save_noise_metrics_plot(
         x=df["noise_factor"],
-        y_by_label=[
-            (df["mean_confidence_uncertainty"], "o", "confidence_uncertainty"),
-            (df["mean_std"], "s", "std_uncertainty"),
-        ],
+        y=df["mean_confidence"],
+        legend_label="Confidence",
+        marker="o",
         x_label="Gaussian Noise Factor",
-        y_label="Uncertainty",
-        title=f"Uncertainty vs Noise ({backend})",
-        output_path=uncertainty_plot_path,
+        y_label="Confidence",
+        title=f"Confidence vs Noise ({backend})",
+        output_path=confidence_plot_path,
+        plot_key="noise_confidence",
     )
     save_noise_metrics_plot(
         x=df["noise_factor"],
-        y_by_label=[
-            (df["mean_confidence_certainty"], "o", "confidence_certainty"),
-        ],
+        y=df["mean_std"],
+        legend_label=secondary_label,
+        marker="^",
+        x_label="Gaussian Noise Factor",
+        y_label=secondary_label,
+        title=f"{secondary_label} vs Noise ({backend})",
+        output_path=secondary_plot_path,
+        plot_key=(
+            "noise_predictive_uncertainty"
+            if backend == "bayesian"
+            else "noise_standard_deviation"
+        ),
+    )
+    save_metric_pair_plot(
+        x=df["noise_factor"],
+        left_y=df["mean_confidence"],
+        right_y=df["mean_std"],
+        left_label="Confidence",
+        right_label=secondary_label,
         x_label="Gaussian Noise Factor",
-        y_label="Certainty",
-        title=f"Confidence Certainty vs Noise ({backend})",
-        output_path=certainty_plot_path,
+        y_label="Confidence/Uncertainty",
+        title=f"Confidence and {secondary_label} vs Noise ({backend})",
+        output_path=confidence_uncertainty_pair_path,
+        plot_key=(
+            "noise_confidence_predictive_uncertainty"
+            if backend == "bayesian"
+            else "noise_confidence_standard_deviation"
+        ),
     )
 
     out = {
         "table": str(csv_path),
-        "plot": str(plot_path),
-        "uncertainty_plot": str(uncertainty_plot_path),
-        "certainty_plot": str(certainty_plot_path),
+        "plots": {
+            "accuracy": str(accuracy_plot_path),
+            "f1": str(f1_plot_path),
+            "accuracy_f1": str(pair_plot_path),
+            "confidence": str(confidence_plot_path),
+            (
+                "confidence_predictive_uncertainty"
+                if backend == "bayesian"
+                else "confidence_standard_deviation"
+            ): str(confidence_uncertainty_pair_path),
+            (
+                "predictive_uncertainty"
+                if backend == "bayesian"
+                else "standard_deviation"
+            ): str(secondary_plot_path),
+        },
         "noise_factors": noise_sigmas,
         "noise_sigmas": noise_sigmas,
         # "mri_intensity_min": float(intensity_min),

+ 172 - 0
analysis/noise_correlation.py

@@ -0,0 +1,172 @@
+# pyright: basic
+
+from __future__ import annotations
+
+from pathlib import Path
+from typing import Any
+
+import matplotlib.pyplot as plt
+import numpy as np
+import pandas as pd
+from scipy import stats
+
+from .plotting import annotate_stats_box, plots_dir
+from .runtime import write_json
+
+
+def _fit_line(metric: np.ndarray, accuracy: np.ndarray) -> dict[str, float]:
+    x = np.asarray(metric, dtype=float)
+    y = np.asarray(accuracy, dtype=float)
+    if len(y) < 3:
+        raise ValueError("Need at least 3 points for linear regression")
+
+    reg = stats.linregress(x, y)
+    return {
+        "intercept": float(reg.intercept),
+        "slope": float(reg.slope),
+        "r_value": float(reg.rvalue),
+        "p_value": float(reg.pvalue),
+        "stderr": float(reg.stderr),
+        "r_squared": float(reg.rvalue**2),
+    }
+
+
+def _metric_specs_for_backend(backend: str, df: pd.DataFrame) -> list[tuple[str, str]]:
+    specs: list[tuple[str, str]] = []
+    if "mean_confidence" in df.columns:
+        specs.append(("mean_confidence", "Confidence"))
+    elif "mean_model_output_probability" in df.columns:
+        # Backward-compatibility fallback for older CSV outputs.
+        specs.append(("mean_model_output_probability", "Confidence"))
+
+    if backend == "bayesian" and "mean_predictive_entropy" in df.columns:
+        specs.append(("mean_predictive_entropy", "Predictive Uncertainty"))
+    elif "mean_std" in df.columns:
+        specs.append(("mean_std", "Standard Deviation"))
+
+    return specs
+
+
+def run_noise_accuracy_uncertainty_analysis(
+    noise_df: pd.DataFrame,
+    backend: str,
+    output_dir: Path,
+) -> dict[str, Any]:
+    required_cols = ["noise_factor", "accuracy"]
+    missing = [c for c in required_cols if c not in noise_df.columns]
+    if missing:
+        raise KeyError(f"Missing required columns in noise dataframe: {missing}")
+
+    metric_specs = _metric_specs_for_backend(backend, noise_df)
+    if not metric_specs:
+        raise ValueError("No uncertainty metrics available for correlation analysis")
+
+    plot_path = plots_dir(output_dir) / "noise_accuracy_uncertainty_2d.png"
+    stats_rows: list[dict[str, Any]] = []
+
+    fig, axes = plt.subplots(1, len(metric_specs), figsize=(8 * len(metric_specs), 6))
+    if len(metric_specs) == 1:
+        axes = np.asarray([axes])
+
+    for idx, (metric_col, metric_label) in enumerate(metric_specs, start=1):
+        if metric_col not in noise_df.columns:
+            continue
+
+        df = noise_df[["noise_factor", metric_col, "accuracy"]].copy()
+        df = df.replace([np.inf, -np.inf], np.nan).dropna()
+        if len(df) < 4:
+            continue
+
+        noise = np.asarray(df["noise_factor"], dtype=float)
+        metric = np.asarray(df[metric_col], dtype=float)
+        accuracy = np.asarray(df["accuracy"], dtype=float)
+
+        pearson = stats.pearsonr(metric, accuracy)
+        fit = _fit_line(metric=metric, accuracy=accuracy)
+
+        stats_rows.append(
+            {
+                "backend": backend,
+                "metric_column": metric_col,
+                "metric_label": metric_label,
+                "n_points": int(len(df)),
+                "pearson_r_metric_vs_accuracy": float(pearson.statistic),
+                "p_value_metric_vs_accuracy": float(pearson.pvalue),
+                "regression_intercept": float(fit["intercept"]),
+                "regression_slope": float(fit["slope"]),
+                "regression_slope_stderr": float(fit["stderr"]),
+                "regression_r_squared": float(fit["r_squared"]),
+            }
+        )
+
+        ax = axes[idx - 1]
+        scatter = ax.scatter(
+            metric,
+            accuracy,
+            c=noise,
+            cmap="viridis",
+            s=42,
+            edgecolors="none",
+        )
+
+        x_line = np.linspace(float(np.min(metric)), float(np.max(metric)), num=200)
+        y_line = fit["intercept"] + fit["slope"] * x_line
+        ax.plot(x_line, y_line, color="#1f77b4", linewidth=2.0, label="Linear fit")
+
+        ax.set_xlabel(metric_label)
+        ax.set_ylabel("Accuracy")
+        ax.set_title(f"{backend.title()} - {metric_label} vs Accuracy")
+        ax.grid(True, alpha=0.3)
+        ax.legend()
+        annotate_stats_box(
+            ax,
+            lines=[
+                f"Pearson r = {pearson.statistic:.3f}",
+                f"p-value = {pearson.pvalue:.3g}",
+                f"R^2 = {fit['r_squared']:.3f}",
+            ],
+            location="upper left",
+        )
+        cbar = fig.colorbar(scatter, ax=ax)
+        cbar.set_label("Noise Factor")
+
+    fig.tight_layout()
+    plot_path.parent.mkdir(parents=True, exist_ok=True)
+    fig.savefig(plot_path)
+    plt.close(fig)
+
+    stats_df = pd.DataFrame(stats_rows)
+    stats_csv = output_dir / "noise_accuracy_uncertainty_stats.csv"
+    stats_df.to_csv(stats_csv, index=False)
+
+    summary_md = output_dir / "noise_accuracy_uncertainty_summary.md"
+    lines = [
+        f"# Noise Accuracy-Uncertainty Analysis ({backend})",
+        "",
+        "This analysis collapses the noise axis and fits a 2D linear relationship between uncertainty metric and accuracy.",
+        "Noise factor is encoded as point color in the plot.",
+        "",
+        "## Metrics",
+    ]
+    if stats_df.empty:
+        lines.append("- No valid metric rows were available for regression.")
+    else:
+        for _, row in stats_df.iterrows():
+            lines.extend(
+                [
+                    f"- {row['metric_label']}: Pearson r={row['pearson_r_metric_vs_accuracy']:.4f}, p={row['p_value_metric_vs_accuracy']:.4g}",
+                    f"  - Regression R^2: {row['regression_r_squared']:.4f}",
+                ]
+            )
+
+    summary_md.write_text("\n".join(lines), encoding="utf-8")
+
+    payload = {
+        "backend": backend,
+        "plot": str(plot_path),
+        "table": str(stats_csv),
+        "summary_markdown": str(summary_md),
+        "rows": int(len(stats_df)),
+    }
+    write_json(output_dir / "noise_accuracy_uncertainty_summary.json", payload)
+    return payload

+ 395 - 38
analysis/plotting.py

@@ -8,6 +8,120 @@ import matplotlib.pyplot as plt
 import numpy as np
 import pandas as pd
 import torch
+from matplotlib.axes import Axes
+
+
+# Easily editable plot text overrides by plot key.
+# Example:
+# "performance_threshold": {
+#     "title": "Custom Title",
+#     "x_label": "Custom X",
+#     "y_label": "Custom Y",
+# }
+# Common keys:
+# - performance_threshold_accuracy
+# - performance_threshold_f1
+# - performance_uncertainty_cutoff_accuracy
+# - performance_uncertainty_cutoff_f1
+# - performance_uncertainty_percentile_cutoff_accuracy
+# - performance_uncertainty_percentile_cutoff_f1
+# - calibration_reliability
+# - noise_sensitivity_accuracy
+# - noise_sensitivity_f1
+# - noise_confidence
+# - noise_standard_deviation
+# - noise_predictive_uncertainty
+# - boxplot
+PLOT_TEXT_OVERRIDES: dict[str, dict[str, str]] = {}
+
+
+def _resolve_plot_text(
+    plot_key: str,
+    default_title: str,
+    default_x_label: str,
+    default_y_label: str,
+) -> tuple[str, str, str]:
+    override = PLOT_TEXT_OVERRIDES.get(plot_key, {})
+    return (
+        override.get("title", default_title),
+        override.get("x_label", default_x_label),
+        override.get("y_label", default_y_label),
+    )
+
+
+def annotate_stats_box(
+    ax: Axes,
+    lines: list[str],
+    location: str = "upper left",
+) -> None:
+    if not lines:
+        return
+
+    locations: dict[str, tuple[float, float, str, str]] = {
+        "upper left": (0.02, 0.98, "left", "top"),
+        "upper right": (0.98, 0.98, "right", "top"),
+        "lower left": (0.02, 0.02, "left", "bottom"),
+        "lower right": (0.98, 0.02, "right", "bottom"),
+    }
+    x, y, ha, va = locations.get(location, locations["upper left"])
+    ax.text(
+        x,
+        y,
+        "\n".join(lines),
+        transform=ax.transAxes,
+        ha=ha,
+        va=va,
+        fontsize=10,
+        bbox={
+            "boxstyle": "round,pad=0.35",
+            "facecolor": "white",
+            "edgecolor": "#555555",
+            "alpha": 0.9,
+        },
+    )
+
+
+def _plot_correct_incorrect_bars(
+    ax: Axes,
+    x_values: pd.Series,
+    n_correct: pd.Series,
+    n_incorrect: pd.Series,
+) -> None:
+    x = np.asarray(x_values, dtype=float)
+    correct = np.asarray(n_correct, dtype=float)
+    incorrect = np.asarray(n_incorrect, dtype=float)
+    if x.size == 0 or correct.size == 0 or incorrect.size == 0:
+        return
+
+    width = float(np.diff(np.sort(x)).min()) * 0.8 if x.size > 1 else 0.04
+    max_count = float(max(np.nanmax(correct), np.nanmax(incorrect), 1.0))
+
+    bars_ax = ax.twinx()
+    bars_ax.patch.set_alpha(0.0)
+    bars_ax.bar(
+        x,
+        correct,
+        width=width,
+        color="#2ca02c",
+        alpha=0.2,
+        label="correct",
+        zorder=0,
+        align="center",
+    )
+    bars_ax.bar(
+        x,
+        -incorrect,
+        width=width,
+        color="#d62728",
+        alpha=0.2,
+        label="incorrect",
+        zorder=0,
+        align="center",
+    )
+    bars_ax.axhline(0.0, color="gray", linewidth=0.8, alpha=0.4)
+    bars_ax.set_ylim(-1.15 * max_count, 1.15 * max_count)
+    bars_ax.set_yticks([])
+    bars_ax.grid(False)
 
 
 def plots_dir(output_dir: Path) -> Path:
@@ -17,14 +131,33 @@ def plots_dir(output_dir: Path) -> Path:
 
 
 def save_performance_threshold_plot(
-    df: pd.DataFrame, backend: str, output_path: Path
+    df: pd.DataFrame,
+    backend: str,
+    output_path: Path,
+    metric_column: str,
+    metric_label: str,
+    plot_key: str,
 ) -> None:
+    title, x_label, y_label = _resolve_plot_text(
+        plot_key=plot_key,
+        default_title=f"{metric_label} vs Decision Threshold ({backend})",
+        default_x_label="Decision Threshold",
+        default_y_label=metric_label,
+    )
+
+    n_correct = pd.to_numeric(df["tp"], errors="coerce") + pd.to_numeric(
+        df["tn"], errors="coerce"
+    )
+    n_incorrect = pd.to_numeric(df["fp"], errors="coerce") + pd.to_numeric(
+        df["fn"], errors="coerce"
+    )
+
     fig, ax = plt.subplots(figsize=(10, 5))
-    ax.plot(df["threshold"], df["accuracy"], label="accuracy", marker="o")
-    ax.plot(df["threshold"], df["f1"], label="f1", marker="s")
-    ax.set_xlabel("Threshold")
-    ax.set_ylabel("Score")
-    ax.set_title(f"Performance vs Threshold ({backend})")
+    _plot_correct_incorrect_bars(ax, df["threshold"], n_correct, n_incorrect)
+    ax.plot(df["threshold"], df[metric_column], label=metric_label, marker="o")
+    ax.set_xlabel(x_label)
+    ax.set_ylabel(y_label)
+    ax.set_title(title)
     ax.grid(True, alpha=0.3)
     ax.legend()
     fig.tight_layout()
@@ -33,13 +166,134 @@ def save_performance_threshold_plot(
     plt.close(fig)
 
 
+def save_performance_threshold_pair_plot(
+    df: pd.DataFrame,
+    backend: str,
+    output_path: Path,
+    plot_key: str,
+) -> None:
+    title, x_label, _ = _resolve_plot_text(
+        plot_key=plot_key,
+        default_title=f"Accuracy and F1 vs Decision Threshold ({backend})",
+        default_x_label="Decision Threshold",
+        default_y_label="Accuracy/F1",
+    )
+
+    n_correct = pd.to_numeric(df["tp"], errors="coerce") + pd.to_numeric(
+        df["tn"], errors="coerce"
+    )
+    n_incorrect = pd.to_numeric(df["fp"], errors="coerce") + pd.to_numeric(
+        df["fn"], errors="coerce"
+    )
+
+    fig, axes = plt.subplots(1, 2, figsize=(14, 5), sharex=True)
+    for ax, metric_col, metric_label, marker in [
+        (axes[0], "accuracy", "Accuracy", "o"),
+        (axes[1], "f1", "F1", "s"),
+    ]:
+        _plot_correct_incorrect_bars(ax, df["threshold"], n_correct, n_incorrect)
+        ax.plot(df["threshold"], df[metric_col], label=metric_label, marker=marker)
+        ax.set_xlabel(x_label)
+        ax.set_ylabel(metric_label)
+        ax.set_title(f"{metric_label}")
+        ax.grid(True, alpha=0.3)
+        ax.legend()
+
+    fig.suptitle(title)
+    fig.tight_layout()
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    fig.savefig(output_path)
+    plt.close(fig)
+
+
 def save_uncertainty_cutoff_plot(
     cutoff_df: pd.DataFrame,
     title_prefix: str,
     x_label: str,
     output_path: Path,
+    metric_column: str,
+    metric_label: str,
+    plot_key: str,
+) -> None:
+    title, x_label_final, y_label = _resolve_plot_text(
+        plot_key=plot_key,
+        default_title=f"{metric_label} vs {title_prefix}",
+        default_x_label=x_label,
+        default_y_label=metric_label,
+    )
+
+    fig, ax = plt.subplots(figsize=(10, 5))
+    first_group = (
+        cutoff_df.sort_values(["uncertainty_type", "restriction_level"])
+        .groupby("uncertainty_type", as_index=False)
+        .head(1)
+    )
+    if not first_group.empty:
+        # Draw count bars once; uncertainty lines are overlaid afterwards.
+        rep_name = str(first_group.iloc[0]["uncertainty_type"])
+        rep = cutoff_df[cutoff_df["uncertainty_type"] == rep_name].sort_values(
+            "restriction_level"
+        )
+        _plot_correct_incorrect_bars(
+            ax,
+            rep["restriction_level"],
+            pd.to_numeric(rep["n_correct"], errors="coerce"),
+            pd.to_numeric(rep["n_incorrect"], errors="coerce"),
+        )
+
+    for uncertainty_name, group in cutoff_df.groupby("uncertainty_type"):
+        g = group.sort_values("restriction_level")
+        ax.plot(
+            g["restriction_level"],
+            g[metric_column],
+            marker="o",
+            label=uncertainty_name,
+        )
+
+    ax.set_title(title)
+    ax.set_xlabel(x_label_final)
+    ax.set_ylabel(y_label)
+    ax.grid(True, alpha=0.3)
+    ax.legend()
+    fig.tight_layout()
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    fig.savefig(output_path)
+    plt.close(fig)
+
+
+def save_uncertainty_cutoff_pair_plot(
+    cutoff_df: pd.DataFrame,
+    title_prefix: str,
+    x_label: str,
+    output_path: Path,
+    plot_key: str,
 ) -> None:
+    title, x_label_final, _ = _resolve_plot_text(
+        plot_key=plot_key,
+        default_title=f"Accuracy and F1 vs {title_prefix}",
+        default_x_label=x_label,
+        default_y_label="Accuracy/F1",
+    )
+
     fig, axes = plt.subplots(1, 2, figsize=(14, 5), sharex=True)
+    first_group = (
+        cutoff_df.sort_values(["uncertainty_type", "restriction_level"])
+        .groupby("uncertainty_type", as_index=False)
+        .head(1)
+    )
+    if not first_group.empty:
+        rep_name = str(first_group.iloc[0]["uncertainty_type"])
+        rep = cutoff_df[cutoff_df["uncertainty_type"] == rep_name].sort_values(
+            "restriction_level"
+        )
+        for ax in axes:
+            _plot_correct_incorrect_bars(
+                ax,
+                rep["restriction_level"],
+                pd.to_numeric(rep["n_correct"], errors="coerce"),
+                pd.to_numeric(rep["n_incorrect"], errors="coerce"),
+            )
+
     for uncertainty_name, group in cutoff_df.groupby("uncertainty_type"):
         g = group.sort_values("restriction_level")
         axes[0].plot(
@@ -49,14 +303,15 @@ def save_uncertainty_cutoff_plot(
             g["restriction_level"], g["f1"], marker="s", label=uncertainty_name
         )
 
-    axes[0].set_title(f"Accuracy vs {title_prefix}")
-    axes[1].set_title(f"F1 vs {title_prefix}")
-    for ax in axes:
-        ax.set_xlabel(x_label)
+    axes[0].set_title("Accuracy")
+    axes[1].set_title("F1")
+    for ax, metric_label in [(axes[0], "Accuracy"), (axes[1], "F1")]:
+        ax.set_xlabel(x_label_final)
+        ax.set_ylabel(metric_label)
         ax.grid(True, alpha=0.3)
         ax.legend()
-    axes[0].set_ylabel("Accuracy")
-    axes[1].set_ylabel("F1")
+
+    fig.suptitle(title)
     fig.tight_layout()
     output_path.parent.mkdir(parents=True, exist_ok=True)
     fig.savefig(output_path)
@@ -64,13 +319,20 @@ def save_uncertainty_cutoff_plot(
 
 
 def save_calibration_plot(per_bin: np.ndarray, backend: str, output_path: Path) -> None:
+    title, x_label, y_label = _resolve_plot_text(
+        plot_key="calibration_reliability",
+        default_title=f"Reliability Diagram ({backend})",
+        default_x_label="Mean Predicted Probability",
+        default_y_label="Empirical Fraction Positive",
+    )
+
     fig, ax = plt.subplots(figsize=(6, 6))
     valid = ~np.isnan(per_bin[:, 1])
     ax.plot([0, 1], [0, 1], linestyle="--", color="gray", label="ideal")
     ax.plot(per_bin[valid, 0], per_bin[valid, 1], marker="o", label=backend)
-    ax.set_xlabel("Mean Predicted Probability")
-    ax.set_ylabel("Empirical Fraction Positive")
-    ax.set_title(f"Reliability Diagram ({backend})")
+    ax.set_xlabel(x_label)
+    ax.set_ylabel(y_label)
+    ax.set_title(title)
     ax.legend()
     ax.grid(True, alpha=0.3)
     fig.tight_layout()
@@ -87,11 +349,23 @@ def save_boxplot(
     title: str,
     output_path: Path,
 ) -> None:
+    title_final, x_label_final, y_label_final = _resolve_plot_text(
+        plot_key="boxplot",
+        default_title=title,
+        default_x_label=x_label,
+        default_y_label=y_label,
+    )
+
+    labels_with_n = [
+        f"{label}\n(n={len(np.asarray(values, dtype=float))})"
+        for label, values in zip(tick_labels, data)
+    ]
+
     fig, ax = plt.subplots(figsize=(9, 5))
-    ax.boxplot(data, tick_labels=tick_labels)
-    ax.set_xlabel(x_label)
-    ax.set_ylabel(y_label)
-    ax.set_title(title)
+    ax.boxplot(data, tick_labels=labels_with_n)
+    ax.set_xlabel(x_label_final)
+    ax.set_ylabel(y_label_final)
+    ax.set_title(title_final)
     ax.grid(True, axis="y", alpha=0.3)
     fig.tight_layout()
     output_path.parent.mkdir(parents=True, exist_ok=True)
@@ -128,26 +402,49 @@ def save_noise_example_grid(
     noisy_by_sigma: list[tuple[float, torch.Tensor]],
     output_path: Path,
     title: str,
+    max_images: int = 9,
+    n_rows: int = 2,
 ) -> None:
     if not noisy_by_sigma:
         return
 
-    original_slice = _normalize_for_display(_central_slice(original_mri))
-    n_rows = len(noisy_by_sigma)
-    fig, axes = plt.subplots(n_rows, 2, figsize=(8, 3.2 * n_rows))
-    if n_rows == 1:
-        axes = np.array([axes])
+    n_total = len(noisy_by_sigma)
+    target = max(1, min(int(max_images), n_total))
+    if target >= n_total:
+        selected = noisy_by_sigma
+    else:
+        # Sample indices across the full tested range, including first and last.
+        raw_idx = np.linspace(0, n_total - 1, num=target)
+        idx = np.round(raw_idx).astype(int)
+        selected_indices = sorted(set(idx.tolist()))
+        if len(selected_indices) < target:
+            existing = set(selected_indices)
+            for i in range(n_total):
+                if i in existing:
+                    continue
+                selected_indices.append(i)
+                if len(selected_indices) >= target:
+                    break
+            selected_indices = sorted(selected_indices)
+
+        selected = [noisy_by_sigma[i] for i in selected_indices]
+
+    n_images = len(selected)
+    n_rows = max(1, int(n_rows))
+    n_cols = int(np.ceil(n_images / n_rows))
 
-    for row_idx, (sigma, noisy_tensor) in enumerate(noisy_by_sigma):
+    fig, axes = plt.subplots(n_rows, n_cols, figsize=(3.8 * n_cols, 3.2 * n_rows))
+    axes_flat = np.atleast_1d(axes).reshape(-1)
+
+    for idx, (sigma, noisy_tensor) in enumerate(selected):
+        ax = axes_flat[idx]
         noisy_slice = _normalize_for_display(_central_slice(noisy_tensor))
-        ax_orig, ax_noisy = axes[row_idx]
-        ax_orig.imshow(original_slice, cmap="gray")
-        ax_orig.set_title("Original")
-        ax_orig.axis("off")
+        ax.imshow(noisy_slice, cmap="gray")
+        ax.set_title(f"Noise factor={sigma:g}")
+        ax.axis("off")
 
-        ax_noisy.imshow(noisy_slice, cmap="gray")
-        ax_noisy.set_title(f"Noisy factor={sigma:g}")
-        ax_noisy.axis("off")
+    for idx in range(n_images, len(axes_flat)):
+        axes_flat[idx].axis("off")
 
     fig.suptitle(title)
     fig.tight_layout()
@@ -156,23 +453,83 @@ def save_noise_example_grid(
     plt.close(fig)
 
 
+def save_clean_scan_image(
+    original_mri: torch.Tensor,
+    output_path: Path,
+) -> None:
+    image = _normalize_for_display(_central_slice(original_mri))
+    fig, ax = plt.subplots(figsize=(4, 4))
+    ax.imshow(image, cmap="gray")
+    ax.axis("off")
+    fig.subplots_adjust(left=0, right=1, top=1, bottom=0)
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    fig.savefig(output_path, bbox_inches="tight", pad_inches=0)
+    plt.close(fig)
+
+
 def save_noise_metrics_plot(
     x: pd.Series,
-    y_by_label: list[tuple[pd.Series, str, str]],
+    y: pd.Series,
+    legend_label: str,
+    marker: str,
     x_label: str,
     y_label: str,
     title: str,
     output_path: Path,
+    plot_key: str,
 ) -> None:
+    title_final, x_label_final, y_label_final = _resolve_plot_text(
+        plot_key=plot_key,
+        default_title=title,
+        default_x_label=x_label,
+        default_y_label=y_label,
+    )
+
     fig, ax = plt.subplots(figsize=(10, 5))
-    for series, marker, label in y_by_label:
-        ax.plot(x, series, marker=marker, label=label)
-    ax.set_xlabel(x_label)
-    ax.set_ylabel(y_label)
-    ax.set_title(title)
+    ax.plot(x, y, marker=marker, label=legend_label)
+    ax.set_xlabel(x_label_final)
+    ax.set_ylabel(y_label_final)
+    ax.set_title(title_final)
     ax.grid(True, alpha=0.3)
     ax.legend()
     fig.tight_layout()
     output_path.parent.mkdir(parents=True, exist_ok=True)
     fig.savefig(output_path)
     plt.close(fig)
+
+
+def save_metric_pair_plot(
+    x: pd.Series,
+    left_y: pd.Series,
+    right_y: pd.Series,
+    left_label: str,
+    right_label: str,
+    x_label: str,
+    y_label: str,
+    title: str,
+    output_path: Path,
+    plot_key: str,
+) -> None:
+    title_final, x_label_final, y_label_final = _resolve_plot_text(
+        plot_key=plot_key,
+        default_title=title,
+        default_x_label=x_label,
+        default_y_label=y_label,
+    )
+
+    fig, axes = plt.subplots(1, 2, figsize=(14, 5), sharex=True)
+    axes[0].plot(x, left_y, marker="o", label=left_label)
+    axes[1].plot(x, right_y, marker="s", label=right_label)
+
+    for ax, name in [(axes[0], left_label), (axes[1], right_label)]:
+        ax.set_xlabel(x_label_final)
+        ax.set_ylabel(name)
+        ax.set_title(name)
+        ax.grid(True, alpha=0.3)
+        ax.legend()
+
+    fig.suptitle(title_final)
+    fig.tight_layout()
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    fig.savefig(output_path)
+    plt.close(fig)