1
0

4 Коммиты bd7ecee3c1 ... e79e7f50c2

Автор SHA1 Сообщение Дата
  Nicholas Schense e79e7f50c2 more analysis work - fixed graphs, going to implement more noise + stdev analysis 1 неделя назад
  Nicholas Schense 810cef2630 Currently working! 1 месяц назад
  Nicholas Schense 1cae771e45 Work on analysis (with AI) 2 месяцев назад
  Nicholas Schense 2e4c5b386d updates to analysis! 2 месяцев назад

+ 2 - 1
.gitignore

@@ -1,2 +1,3 @@
 .venv/
-__pycache__/
+__pycache__/
+analysis_output/

+ 124 - 0
analysis/ANALYSES_OVERVIEW.md

@@ -0,0 +1,124 @@
+# Analysis Overview
+
+This document summarizes what each analysis in this folder does and what it produces.
+
+## Runtime Defaults
+
+- Analysis defaults are centralized in `analysis/defaults.py`.
+- The orchestrator CLI in `analysis/evaluate_models.py` is intentionally minimal and focuses on operational controls (`--backend`, `--run-name`, `--skip-noise`).
+- Standalone operational modes include `--longitudinal-breakdown-only`, `--noise-correlation-only`, and `--dataset-summary-only`.
+- Threshold grids, uncertainty percentile grids, noise-factor grids, calibration bins, class index, decision threshold, and bayesian MC passes are sourced from `analysis/defaults.py`.
+
+## Shared Utilities
+
+- Shared loader/split logic is centralized in `analysis/data_pipeline.py`.
+- All plotting code is centralized in `analysis/plotting.py` for easier inspection and maintenance.
+
+## 1. Performance Threshold Sweep
+
+- Purpose: Measure classification performance as decision threshold changes.
+- Inputs: Ground-truth labels and predicted probabilities.
+- Method: Evaluate metrics across an evenly spaced threshold grid.
+- Main outputs:
+  - `performance_threshold_sweep.csv`
+  - `plots/performance_threshold_accuracy.png`
+  - `plots/performance_threshold_f1.png`
+
+## 2. Uncertainty Cutoff (Raw-Value)
+
+- Purpose: Evaluate performance on subsets with uncertainty below percentile-derived cutoff values.
+- Inputs: Uncertainty arrays (confidence-derived uncertainty and backend-specific uncertainty).
+- Method:
+  - Build evenly spaced percentile points.
+  - Convert each percentile to a raw uncertainty cutoff value.
+  - Keep samples where uncertainty is less than or equal to that cutoff.
+  - Compute accuracy and F1 for each retained subset.
+- Main outputs:
+  - `performance_uncertainty_cutoff.csv`
+  - `plots/performance_uncertainty_cutoff_accuracy.png`
+  - `plots/performance_uncertainty_cutoff_f1.png`
+
+## 3. Uncertainty Cutoff (Percentile-Ranked)
+
+- Purpose: Evaluate performance from all samples toward only the lowest-uncertainty samples.
+- Inputs: Same uncertainty arrays as above.
+- Method:
+  - Use an evenly spaced percentile grid.
+  - Keep samples where uncertainty is less than or equal to the selected percentile cutoff.
+  - Plot from least restricted on the left (all samples) to most restricted on the right (lowest-uncertainty subset only).
+- Main outputs:
+  - `performance_uncertainty_percentile_cutoff.csv`
+  - `plots/performance_uncertainty_percentile_cutoff_accuracy.png`
+  - `plots/performance_uncertainty_percentile_cutoff_f1.png`
+
+## 4. Calibration Analysis
+
+- Purpose: Quantify probability calibration quality.
+- Inputs: Ground-truth labels and predicted probabilities.
+- Method:
+  - Reliability binning with configurable bin count.
+  - Compute MCE and Brier score.
+- Main outputs:
+  - `calibration_bins.csv`
+  - `plots/calibration_reliability.png`
+
+## 5. Physician Confidence Comparison
+
+- Purpose: Compare model uncertainty with physician confidence ratings.
+- Inputs: Evaluation outputs plus clinical table (Image Data ID + physician confidence column).
+- Method:
+  - Merge by image ID.
+  - Group metrics by physician confidence level.
+  - Plot distributions per rating group.
+- Main outputs include grouped summary CSV files and boxplots for confidence and standard deviation (ensemble) or predictive uncertainty (bayesian).
+
+## 6. Longitudinal Stability Analysis
+
+- Purpose: Examine uncertainty patterns across stable and transitioning patient trajectories.
+- Inputs: Evaluation outputs and clinical timeline information.
+- Method:
+  - Build patient-level trajectories.
+  - Group by clinical cohort dynamics.
+  - Compare uncertainty summaries across cohorts.
+- Main outputs include patient/cohort summary CSV files and cohort uncertainty plots.
+
+## 7. Noise Sensitivity Analysis
+
+- Purpose: Test robustness and uncertainty behavior under synthetic Gaussian noise.
+- Inputs: Holdout data loader, model backend, noise factor schedule, threshold, calibration bins.
+- Method:
+  - Use an evenly spaced noise factor schedule.
+  - Add Gaussian noise scaled by a fixed intensity-range factor.
+  - Recompute performance and calibration at each sigma.
+  - Save visual examples of noised images.
+- Main outputs:
+  - `noise_sensitivity.csv`
+  - `plots/noise_sensitivity_accuracy.png`
+  - `plots/noise_sensitivity_f1.png`
+  - `plots/noise_confidence.png`
+  - `plots/noise_standard_deviation.png` (ensemble) / `plots/noise_predictive_uncertainty.png` (bayesian)
+  - `plots/noise_examples/*_noise_examples.png`
+  - `plots/noise_examples/*_clean_scan_example.png`
+
+## 8. Dataset Composition Summary
+
+- Purpose: Report dataset composition without rerunning full evaluation analyses.
+- Inputs: Raw dataset files and configured train/validation/test split ratios.
+- Method:
+  - Rebuild dataset and split assignment using the configured split seed.
+  - Count total images and positive/negative labels overall.
+  - Count train/validation/test images and per-split class balance.
+  - Compute both overall and split-level percentages.
+- Main outputs:
+  - `dataset_summary.md`
+- CLI:
+  - `python -m analysis.evaluate_models --dataset-summary-only`
+
+## Plot Report
+
+- Each backend output now includes `plots_report.md`.
+- The report embeds all generated plot images and includes a short description per plot.
+
+## Notes on Even Spacing
+
+The pipeline uses evenly spaced grids for sampled x-axis points in threshold and uncertainty-cutoff performance plots, and for the sigma schedule used by noise sensitivity plots.

+ 153 - 0
analysis/README.md

@@ -0,0 +1,153 @@
+# Model Evaluation Code
+
+## Description
+
+This folder should contain all the necessary code for the evaluation of the Bayeisan model and the Deep Ensemble. It should generate and save graphs and statistics for this. The included analyses should include
+
+- Performance information (i.e. basic information on accuracy, number correct, number incorrect, F1 score, etc.)
+- Some basic metrics for uncertainty (MCE, Brier)
+- Physican Confidence Analysis (graphing uncertainty vs physican confidence)
+- Longitudinal Analysis (graphing uncertainty on patients who remained stable CN, stable AD, or switched from CN to AD)
+- Noise introduction analysis (graphing uncertainty on normal image and images with increasing levels of Gaussian noise applied)
+
+"Uncertainty" should be taken to mean EITHER confidence or standard deviation for the ensemble and the bayesian network (i.e. the raw outputs distance from 0.5 or either the stdev from the Bayesian or the stdev of all the model outputs). Analyses should be evaluated with both.
+
+The code in the senior_research_thesis and the prebious alnn_rewrite/analysis folders should be consulted. The models should be loaded from the currently in-place config.toml
+
+## Implementation Status
+
+The modular implementation now lives entirely in this folder.
+
+- All new source code is under `alnn_rewrite/analysis`
+- All generated artifacts are written under `alnn_rewrite/analysis_output`
+
+If a backend does not already have `model_evaluation_results.nc`, the pipeline now automatically evaluates that backend on the holdout datasets (validation + test) first, saves the generated netCDF into that backend's model output directory, and then runs the analyses.
+
+Uncertainty analyses are now run using both:
+
+- Confidence
+- Standard deviation (ensemble)
+- Predictive uncertainty (bayesian; computed via `bayesian_torch.utils.util.predictive_entropy`)
+
+### Current Modules
+
+- `evaluate_models.py`: Orchestrator CLI for running selected analyses across ensemble and bayesian backends.
+- `defaults.py`: Centralized analysis defaults (threshold grid, noise factors, calibration bins, class index, decision threshold, bayesian MC passes).
+- `data_pipeline.py`: Shared analysis data loading and split/loader construction utilities.
+- `plotting.py`: Shared plotting functions for all analysis visualizations.
+- `uncertainty.py`: Shared confidence certainty/uncertainty transforms.
+- `model_utils.py`: Shared model behavior utilities (including bayesian sampling mode setup).
+- `runtime.py`: Runtime paths and JSON helpers.
+- `data_access.py`: netCDF loading, class probability extraction, and clinical table access.
+- `metrics.py`: Shared performance and calibration metrics.
+- `analysis_modules.py`: Performance, calibration, physician confidence, and longitudinal analyses.
+- `noise_analysis.py`: Evaluation-time Gaussian noise sensitivity analysis.
+
+### How To Run
+
+From `alnn_rewrite`:
+
+```bash
+python -m analysis.evaluate_models
+```
+
+Useful options:
+
+```bash
+python -m analysis.evaluate_models \
+	--backend ensemble bayesian \
+	--run-name first_modular_run
+```
+
+If you want to skip noise analysis while validating the pipeline:
+
+```bash
+python -m analysis.evaluate_models --skip-noise
+```
+
+If you want to quickly investigate longitudinal cohort breakdown only (without rerunning all analyses):
+
+```bash
+python -m analysis.evaluate_models --longitudinal-breakdown-only
+```
+
+If you want to rerun only the noise uncertainty-vs-accuracy regression/correlation analysis from existing noise CSV outputs:
+
+```bash
+python -m analysis.evaluate_models --noise-correlation-only --run-name run_YYYYMMDD_HHMMSS
+```
+
+If you want to generate only dataset composition documentation (total images, class counts, train/validation/test counts, and percentage breakdowns) without rerunning analyses:
+
+```bash
+python -m analysis.evaluate_models --dataset-summary-only
+```
+
+This writes `dataset_summary.md` into the selected run directory under `analysis_output`.
+
+Most analysis tuning parameters are now intentionally centralized in `analysis/defaults.py` to reduce CLI verbosity and improve maintainability.
+
+### Output Layout
+
+Each run creates a dedicated directory:
+
+```text
+alnn_rewrite/analysis_output/
+	run_YYYYMMDD_HHMMSS/
+		run_manifest.json
+		ensemble/
+			backend_summary.json
+			plots_report.md
+			performance_threshold_sweep.csv
+			calibration_bins.csv
+			performance_uncertainty_cutoff.csv
+			performance_uncertainty_percentile_cutoff.csv
+			physician_grouped_metrics.csv
+			physician_confidence_grouped_metrics.csv
+			physician_std_grouped_metrics.csv
+			longitudinal_patient_summary.csv
+			longitudinal_cohort_summary.csv
+			longitudinal_uncertainty_by_cohort.csv
+			longitudinal_confidence_patient_summary.csv
+			longitudinal_std_patient_summary.csv
+			longitudinal_confidence_cohort_summary.csv
+			longitudinal_std_cohort_summary.csv
+			noise_sensitivity.csv
+			noise_accuracy_uncertainty_stats.csv
+			noise_accuracy_uncertainty_summary.md
+			plots/
+				performance_threshold_accuracy.png
+				performance_threshold_f1.png
+				calibration_reliability.png
+				performance_uncertainty_cutoff_accuracy.png
+				performance_uncertainty_cutoff_f1.png
+				performance_uncertainty_percentile_cutoff_accuracy.png
+				performance_uncertainty_percentile_cutoff_f1.png
+				physician_confidence_boxplot.png
+				physician_std_boxplot.png
+				longitudinal_cohort_confidence.png
+				longitudinal_cohort_std.png
+				noise_sensitivity_accuracy.png
+				noise_sensitivity_f1.png
+				noise_confidence.png
+				noise_standard_deviation.png (ensemble) / noise_predictive_uncertainty.png (bayesian)
+				noise_accuracy_uncertainty_2d.png
+				noise_examples/
+					ensemble_noise_examples.png
+					bayesian_noise_examples.png
+					ensemble_clean_scan_example.png
+					bayesian_clean_scan_example.png
+		bayesian/
+			... (same structure)
+```
+
+The default noise schedule now includes noisier settings beyond the earlier small-sigma cases, so the saved example images show a clearer progression from lightly noised to heavily noised volumes.
+
+Performance analysis now includes two uncertainty cutoff views:
+
+- A raw-value cutoff sweep that keeps samples with uncertainty below the percentile-derived cutoff value.
+- A percentile-ranked cutoff sweep that is plotted from least restricted (left, all samples) to most restricted (right, lowest-uncertainty subset).
+
+Noise analysis now saves individual series plots (accuracy, F1, confidence, and standard deviation or predictive uncertainty) so each metric is shown on its own graph.
+
+Noise plots now label the x-axis as Gaussian Noise Factor to reflect that values are multipliers on the actual noise sigma.

+ 1 - 0
analysis/__init__.py

@@ -0,0 +1 @@
+"""Modular analysis package for model evaluation."""

+ 691 - 0
analysis/analysis_modules.py

@@ -0,0 +1,691 @@
+# pyright: basic
+
+from __future__ import annotations
+
+from pathlib import Path
+import re
+from typing import Any
+
+import numpy as np
+import pandas as pd
+
+from .data_access import BackendEvaluation, physician_column
+from .defaults import DEFAULT_DECISION_THRESHOLD, uncertainty_cutoff_percentiles
+from .metrics import calibration_stats, performance_at_threshold, threshold_sweep
+from .plotting import (
+    plots_dir,
+    save_boxplot,
+    save_calibration_plot,
+    save_performance_threshold_plot,
+    save_performance_threshold_pair_plot,
+    save_uncertainty_cutoff_plot,
+    save_uncertainty_cutoff_pair_plot,
+)
+from .runtime import write_json
+
+
+def _save_table(rows: list[dict[str, Any]], out_path: Path) -> pd.DataFrame:
+    df = pd.DataFrame(rows)
+    out_path.parent.mkdir(parents=True, exist_ok=True)
+    df.to_csv(out_path, index=False)
+    return df
+
+
+def _uncertainty_percentiles(values: np.ndarray) -> np.ndarray:
+    vals = np.asarray(values, dtype=float)
+    n = len(vals)
+    if n == 0:
+        return np.asarray([], dtype=float)
+    if n == 1:
+        return np.asarray([0.0], dtype=float)
+
+    order = np.argsort(vals)
+    ranks = np.empty(n, dtype=float)
+    ranks[order] = np.arange(0, n, dtype=float)
+    return (ranks / float(n - 1)) * 100.0
+
+
+def _save_ensemble_prediction_debug(
+    evaluation: BackendEvaluation,
+    output_dir: Path,
+) -> str:
+    uncertainty_vals = np.asarray(evaluation.uncertainty_confidence, dtype=float)
+    uncertainty_pct = _uncertainty_percentiles(uncertainty_vals)
+    pred_label = (evaluation.y_prob >= DEFAULT_DECISION_THRESHOLD).astype(int)
+    true_label = np.asarray(evaluation.y_true, dtype=int)
+    is_correct = (pred_label == true_label).astype(int)
+
+    debug_df = pd.DataFrame(
+        {
+            "image_id": evaluation.image_ids.astype(int),
+            "predicted_probability": np.asarray(evaluation.y_prob, dtype=float),
+            "predicted_label": pred_label,
+            "actual_label": true_label,
+            "is_correct": is_correct,
+            "confidence": uncertainty_vals,
+            "confidence_percentile": uncertainty_pct,
+        }
+    ).sort_values("confidence", ascending=False)
+
+    path = output_dir / "ensemble_prediction_debug.csv"
+    debug_df.to_csv(path, index=False)
+    return str(path)
+
+
+def _uncertainty_cutoff_analysis(
+    evaluation: BackendEvaluation,
+    output_dir: Path,
+    uncertainty_types: list[tuple[str, np.ndarray]],
+    cutoff_percentiles: np.ndarray,
+    table_filename: str,
+    plot_filename_prefix: str,
+    title_prefix: str,
+    x_label: str,
+) -> dict[str, Any]:
+    def _slug(name: str) -> str:
+        s = re.sub(r"[^a-z0-9]+", "_", name.strip().lower())
+        return s.strip("_") or "metric"
+
+    def _confidence_like(name: str) -> bool:
+        return "confidence" in name.strip().lower()
+
+    cutoff_rows: list[dict[str, Any]] = []
+
+    for uncertainty_name, values in uncertainty_types:
+        finite_mask = np.isfinite(values)
+        if not finite_mask.any():
+            continue
+
+        values_valid = values[finite_mask]
+        y_true_valid = evaluation.y_true[finite_mask]
+        y_prob_valid = evaluation.y_prob[finite_mask]
+        keep_higher = _confidence_like(uncertainty_name)
+        selection_label = (
+            "metric >= percentile cutoff (higher is lower uncertainty)"
+            if keep_higher
+            else "metric <= percentile cutoff (lower is lower uncertainty)"
+        )
+
+        for cutoff_percentile in cutoff_percentiles:
+            cutoff_value = float(np.percentile(values_valid, cutoff_percentile))
+            if keep_higher:
+                keep_mask = values_valid >= cutoff_value
+            else:
+                keep_mask = values_valid <= cutoff_value
+
+            retained = int(keep_mask.sum())
+            if retained == 0:
+                continue
+
+            perf = performance_at_threshold(
+                y_true=y_true_valid[keep_mask],
+                y_prob=y_prob_valid[keep_mask],
+                threshold=DEFAULT_DECISION_THRESHOLD,
+            )
+            cutoff_rows.append(
+                {
+                    "uncertainty_type": uncertainty_name,
+                    "cutoff_percentile": float(cutoff_percentile),
+                    "cutoff_value": cutoff_value,
+                    "n_retained": retained,
+                    "coverage": float(retained / len(values_valid)),
+                    "selection_rule": selection_label,
+                    "keep_higher": bool(keep_higher),
+                    "accuracy": float(perf["accuracy"]),
+                    "f1": float(perf["f1"]),
+                    "n_correct": int(perf["tp"] + perf["tn"]),
+                    "n_incorrect": int(perf["fp"] + perf["fn"]),
+                }
+            )
+
+    table_path = output_dir / table_filename
+    accuracy_plot_path = plots_dir(output_dir) / f"{plot_filename_prefix}_accuracy.png"
+    f1_plot_path = plots_dir(output_dir) / f"{plot_filename_prefix}_f1.png"
+    pair_plot_path = plots_dir(output_dir) / f"{plot_filename_prefix}_accuracy_f1.png"
+    if not cutoff_rows:
+        return {
+            "table": str(table_path),
+            "accuracy_plot": str(accuracy_plot_path),
+            "f1_plot": str(f1_plot_path),
+            "pair_plot": str(pair_plot_path),
+            "rows": 0,
+        }
+
+    cutoff_df = pd.DataFrame(cutoff_rows)
+    cutoff_df.to_csv(table_path, index=False)
+
+    # Restriction level increases from left to right so right-most points are most restricted.
+    cutoff_df["restriction_level"] = np.where(
+        cutoff_df["keep_higher"].astype(bool),
+        cutoff_df["cutoff_percentile"],
+        100.0 - cutoff_df["cutoff_percentile"],
+    )
+
+    plots_by_uncertainty: dict[str, dict[str, str]] = {}
+    for uncertainty_name in sorted(pd.unique(cutoff_df["uncertainty_type"])):
+        sub_df = cutoff_df[cutoff_df["uncertainty_type"] == uncertainty_name].copy()
+        slug = _slug(str(uncertainty_name))
+        sub_accuracy_plot_path = (
+            plots_dir(output_dir) / f"{plot_filename_prefix}_{slug}_accuracy.png"
+        )
+        sub_f1_plot_path = (
+            plots_dir(output_dir) / f"{plot_filename_prefix}_{slug}_f1.png"
+        )
+        sub_pair_plot_path = (
+            plots_dir(output_dir) / f"{plot_filename_prefix}_{slug}_accuracy_f1.png"
+        )
+
+        save_uncertainty_cutoff_plot(
+            cutoff_df=sub_df,
+            title_prefix=f"{title_prefix} ({uncertainty_name})",
+            x_label=x_label,
+            output_path=sub_accuracy_plot_path,
+            metric_column="accuracy",
+            metric_label="Accuracy",
+            plot_key=f"{plot_filename_prefix}_{slug}_accuracy",
+        )
+        save_uncertainty_cutoff_plot(
+            cutoff_df=sub_df,
+            title_prefix=f"{title_prefix} ({uncertainty_name})",
+            x_label=x_label,
+            output_path=sub_f1_plot_path,
+            metric_column="f1",
+            metric_label="F1",
+            plot_key=f"{plot_filename_prefix}_{slug}_f1",
+        )
+        save_uncertainty_cutoff_pair_plot(
+            cutoff_df=sub_df,
+            title_prefix=f"{title_prefix} ({uncertainty_name})",
+            x_label=x_label,
+            output_path=sub_pair_plot_path,
+            plot_key=f"{plot_filename_prefix}_{slug}_accuracy_f1",
+        )
+
+        plots_by_uncertainty[str(uncertainty_name)] = {
+            "accuracy": str(sub_accuracy_plot_path),
+            "f1": str(sub_f1_plot_path),
+            "accuracy_f1": str(sub_pair_plot_path),
+        }
+
+    return {
+        "table": str(table_path),
+        "plots_by_uncertainty": plots_by_uncertainty,
+        "rows": int(len(cutoff_df)),
+    }
+
+
+def run_performance(
+    evaluation: BackendEvaluation,
+    output_dir: Path,
+    thresholds: np.ndarray,
+) -> dict[str, Any]:
+    rows = threshold_sweep(evaluation.y_true, evaluation.y_prob, thresholds)
+    table_path = output_dir / "performance_threshold_sweep.csv"
+    df = _save_table(rows, table_path)
+
+    accuracy_plot_path = plots_dir(output_dir) / "performance_threshold_accuracy.png"
+    f1_plot_path = plots_dir(output_dir) / "performance_threshold_f1.png"
+    pair_plot_path = plots_dir(output_dir) / "performance_threshold_accuracy_f1.png"
+    save_performance_threshold_plot(
+        df=df,
+        backend=evaluation.backend,
+        output_path=accuracy_plot_path,
+        metric_column="accuracy",
+        metric_label="Accuracy",
+        plot_key="performance_threshold_accuracy",
+    )
+    save_performance_threshold_plot(
+        df=df,
+        backend=evaluation.backend,
+        output_path=f1_plot_path,
+        metric_column="f1",
+        metric_label="F1",
+        plot_key="performance_threshold_f1",
+    )
+    save_performance_threshold_pair_plot(
+        df=df,
+        backend=evaluation.backend,
+        output_path=pair_plot_path,
+        plot_key="performance_threshold_accuracy_f1",
+    )
+
+    best_idx = int(df["f1"].idxmax())
+    best = df.iloc[best_idx].to_dict()
+
+    cutoff_percentiles = uncertainty_cutoff_percentiles()
+    model_output_vals = np.asarray(evaluation.uncertainty_confidence, dtype=float)
+    secondary_uncertainty = np.asarray(evaluation.uncertainty_std, dtype=float)
+    secondary_uncertainty_name = (
+        "predictive uncertainty"
+        if evaluation.uncertainty_metric == "predictive_entropy"
+        else "standard deviation"
+    )
+    uncertainty_types = [
+        ("confidence", model_output_vals),
+        (secondary_uncertainty_name, secondary_uncertainty),
+    ]
+
+    uncertainty_cutoff = _uncertainty_cutoff_analysis(
+        evaluation=evaluation,
+        output_dir=output_dir,
+        uncertainty_types=uncertainty_types,
+        cutoff_percentiles=cutoff_percentiles,
+        table_filename="performance_uncertainty_cutoff.csv",
+        plot_filename_prefix="performance_uncertainty_cutoff",
+        title_prefix="Model Output / Uncertainty Cutoff Percentile",
+        x_label="Restriction Level (0 = all samples, 100 = most restricted subset)",
+    )
+
+    percentile_cutoffs = uncertainty_cutoff_percentiles()
+    uncertainty_percentile_cutoff = _uncertainty_cutoff_analysis(
+        evaluation=evaluation,
+        output_dir=output_dir,
+        uncertainty_types=uncertainty_types,
+        cutoff_percentiles=percentile_cutoffs,
+        table_filename="performance_uncertainty_percentile_cutoff.csv",
+        plot_filename_prefix="performance_uncertainty_percentile_cutoff",
+        title_prefix="Model Output / Uncertainty Percentile Floor",
+        x_label="Percentile Floor (0 = all samples, 100 = top percentile subset)",
+    )
+
+    cutoff_table_path = Path(uncertainty_cutoff["table"])
+    percentile_cutoff_table_path = Path(uncertainty_percentile_cutoff["table"])
+    summary = {
+        "best_by_f1": {
+            k: float(v) for k, v in best.items() if isinstance(v, (int, float))
+        },
+        "table": str(table_path),
+        "plots": {
+            "accuracy": str(accuracy_plot_path),
+            "f1": str(f1_plot_path),
+            "accuracy_f1": str(pair_plot_path),
+        },
+        "uncertainty_cutoff": {
+            "table": str(cutoff_table_path),
+            "plots_by_uncertainty": uncertainty_cutoff["plots_by_uncertainty"],
+            "decision_threshold": DEFAULT_DECISION_THRESHOLD,
+        },
+        "uncertainty_percentile_cutoff": {
+            "table": str(percentile_cutoff_table_path),
+            "plots_by_uncertainty": uncertainty_percentile_cutoff[
+                "plots_by_uncertainty"
+            ],
+            "decision_threshold": DEFAULT_DECISION_THRESHOLD,
+        },
+    }
+    if evaluation.backend == "ensemble":
+        summary["ensemble_prediction_debug"] = _save_ensemble_prediction_debug(
+            evaluation=evaluation,
+            output_dir=output_dir,
+        )
+    write_json(output_dir / "performance_summary.json", summary)
+    return summary
+
+
+def run_calibration(
+    evaluation: BackendEvaluation,
+    output_dir: Path,
+    bins: int,
+) -> dict[str, Any]:
+    summary, per_bin = calibration_stats(
+        evaluation.y_true, evaluation.y_prob, bins=bins
+    )
+
+    bin_df = pd.DataFrame(
+        per_bin,
+        columns=["mean_confidence", "fraction_positive", "count"],
+    )
+    table_path = output_dir / "calibration_bins.csv"
+    bin_df.to_csv(table_path, index=False)
+
+    plot_path = plots_dir(output_dir) / "calibration_reliability.png"
+    save_calibration_plot(
+        per_bin=per_bin,
+        backend=evaluation.backend,
+        output_path=plot_path,
+    )
+
+    out = {
+        **summary,
+        "table": str(table_path),
+        "plot": str(plot_path),
+    }
+    write_json(output_dir / "calibration_summary.json", out)
+    return out
+
+
+def run_physician(
+    evaluation: BackendEvaluation,
+    clinical_df: pd.DataFrame,
+    output_dir: Path,
+) -> dict[str, Any]:
+    secondary_key = (
+        "predictive_entropy"
+        if evaluation.uncertainty_metric == "predictive_entropy"
+        else "std"
+    )
+    secondary_label = (
+        "Predictive Uncertainty"
+        if secondary_key == "predictive_entropy"
+        else "Standard Deviation"
+    )
+
+    col = physician_column(clinical_df)
+    subset = clinical_df[["Image Data ID", col]].copy()
+    subset[col] = pd.to_numeric(subset[col], errors="coerce")
+    subset = subset.dropna(subset=["Image Data ID", col])
+    subset["Image Data ID"] = subset["Image Data ID"].astype(int)
+    subset[col] = subset[col].astype(int)
+
+    eval_df = pd.DataFrame(
+        {
+            "Image Data ID": evaluation.image_ids.astype(int),
+            "model_probability": evaluation.uncertainty_confidence,
+            "model_std": evaluation.uncertainty_std,
+            "model_prob": evaluation.y_prob,
+        }
+    )
+    merged = eval_df.merge(subset, on="Image Data ID", how="inner")
+
+    if merged.empty:
+        raise ValueError("No overlapping Image Data ID rows for physician analysis")
+
+    grouped_rows: list[dict[str, Any]] = []
+    uncertainty_specs = [
+        ("confidence", "model_probability", "Confidence"),
+        (secondary_key, "model_std", secondary_label),
+    ]
+    ratings = [int(r) for r in sorted(pd.unique(merged[col]))]
+    plot_paths: dict[str, str] = {}
+    correlations: dict[str, float] = {}
+
+    for metric_name, metric_col, metric_label in uncertainty_specs:
+        grouped_metric = (
+            merged.groupby(col)
+            .agg(
+                n=("Image Data ID", "count"),
+                mean_value=(metric_col, "mean"),
+                std_value=(metric_col, "std"),
+                mean_prob=("model_prob", "mean"),
+            )
+            .reset_index()
+            .rename(columns={col: "physician_rating"})
+        )
+        grouped_metric["uncertainty_type"] = metric_name
+        grouped_rows.extend(
+            [
+                {str(k): v for k, v in rec.items()}
+                for rec in grouped_metric.to_dict(orient="records")
+            ]
+        )
+
+        data = [
+            np.asarray(merged.loc[merged[col] == r, metric_col], dtype=float)
+            for r in ratings
+        ]
+        plot_path = plots_dir(output_dir) / f"physician_{metric_name}_boxplot.png"
+        save_boxplot(
+            data=data,
+            tick_labels=[str(r) for r in ratings],
+            x_label="Physician Confidence Rating (DXCONFID)",
+            y_label=metric_label,
+            title=f"{metric_label} by Physician Confidence Rating ({evaluation.backend})",
+            output_path=plot_path,
+        )
+
+        corr = float(
+            pd.to_numeric(
+                merged[[metric_col, col]].corr(method="spearman").iloc[0, 1],
+                errors="coerce",
+            )
+        )
+        correlations[metric_name] = corr
+        plot_paths[metric_name] = str(plot_path)
+
+    grouped = pd.DataFrame(grouped_rows)
+    table_path = output_dir / "physician_grouped_metrics.csv"
+    grouped.to_csv(table_path, index=False)
+
+    confidence_table = output_dir / "physician_confidence_grouped_metrics.csv"
+    std_table = output_dir / "physician_std_grouped_metrics.csv"
+    secondary_table = output_dir / f"physician_{secondary_key}_grouped_metrics.csv"
+    grouped[grouped["uncertainty_type"] == "confidence"].to_csv(
+        confidence_table, index=False
+    )
+    grouped[grouped["uncertainty_type"] == secondary_key].to_csv(
+        secondary_table, index=False
+    )
+    grouped[grouped["uncertainty_type"] == secondary_key].to_csv(std_table, index=False)
+
+    out = {
+        "n_overlap": int(len(merged)),
+        "spearman_vs_dxconfid": correlations,
+        "table": str(table_path),
+        "tables": {
+            "confidence": str(confidence_table),
+            secondary_key: str(secondary_table),
+            "std": str(std_table),
+        },
+        "plots": plot_paths,
+    }
+    write_json(output_dir / "physician_summary.json", out)
+    return out
+
+
+def _normalize_dx(value: Any) -> str:
+    if value is None or (isinstance(value, float) and np.isnan(value)):
+        return ""
+
+    v = str(value).strip().upper()
+    if v in {"NL", "NORMAL"}:
+        return "CN"
+    return v
+
+
+def run_longitudinal(
+    evaluation: BackendEvaluation,
+    clinical_df: pd.DataFrame,
+    output_dir: Path,
+) -> dict[str, Any]:
+    secondary_key = (
+        "predictive_entropy"
+        if evaluation.uncertainty_metric == "predictive_entropy"
+        else "std"
+    )
+    secondary_label = (
+        "Mean Predictive Uncertainty"
+        if secondary_key == "predictive_entropy"
+        else "Mean Standard Deviation"
+    )
+
+    required = ["Image Data ID", "PTID"]
+    missing = [c for c in required if c not in clinical_df.columns]
+    if missing:
+        raise KeyError(f"Missing columns for longitudinal analysis: {missing}")
+
+    diagnosis_col = None
+    for candidate in ["Class", "DX", "Diagnosis"]:
+        if candidate in clinical_df.columns:
+            diagnosis_col = candidate
+            break
+    if diagnosis_col is None:
+        raise KeyError(
+            "No diagnosis column found. Expected one of: Class, DX, Diagnosis"
+        )
+
+    work = clinical_df[
+        ["Image Data ID", "PTID", diagnosis_col]
+        + [c for c in ["EXAMDATE"] if c in clinical_df.columns]
+    ].copy()
+    work["Image Data ID"] = pd.to_numeric(work["Image Data ID"], errors="coerce")
+    work = work.dropna(subset=["Image Data ID", "PTID"])
+    work["Image Data ID"] = work["Image Data ID"].astype(int)
+    work["PTID"] = work["PTID"].astype(str).str.strip()
+    work["diagnosis"] = work[diagnosis_col].map(_normalize_dx)
+
+    if "EXAMDATE" in work.columns:
+        work["EXAMDATE"] = pd.to_datetime(work["EXAMDATE"], errors="coerce")
+        work = work.sort_values(["PTID", "EXAMDATE"], na_position="last")
+    else:
+        work = work.sort_values(["PTID", "Image Data ID"])
+
+    eval_df = pd.DataFrame(
+        {
+            "Image Data ID": evaluation.image_ids.astype(int),
+            "model_confidence": evaluation.uncertainty_confidence,
+            "model_std": evaluation.uncertainty_std,
+            "model_prob": evaluation.y_prob,
+        }
+    )
+    merged = work.merge(eval_df, on="Image Data ID", how="inner")
+
+    if merged.empty:
+        raise ValueError("No overlapping Image Data ID rows for longitudinal analysis")
+
+    patient_rows: list[dict[str, Any]] = []
+    for ptid, group in merged.groupby("PTID"):
+        diagnoses = [d for d in group["diagnosis"].tolist() if d]
+        if len(diagnoses) < 2:
+            continue
+
+        first_dx = diagnoses[0]
+        last_dx = diagnoses[-1]
+        unique_dx = set(diagnoses)
+
+        cohort = "other"
+        if unique_dx.issubset({"CN"}):
+            cohort = "stable_cn"
+        elif unique_dx.issubset({"AD"}):
+            cohort = "stable_ad"
+        elif first_dx == "CN" and last_dx == "AD" and unique_dx.issubset({"CN", "AD"}):
+            cohort = "cn_to_ad"
+
+        patient_rows.append(
+            {
+                "PTID": ptid,
+                "n_visits": int(len(group)),
+                "first_dx": first_dx,
+                "last_dx": last_dx,
+                "cohort": cohort,
+                "mean_confidence": float(group["model_confidence"].mean()),
+                "mean_std": float(group["model_std"].mean()),
+                "mean_prob": float(group["model_prob"].mean()),
+            }
+        )
+
+    patient_df = pd.DataFrame(patient_rows)
+    patient_df = patient_df[
+        patient_df["cohort"].isin(["stable_cn", "stable_ad", "cn_to_ad"])
+    ].copy()
+
+    table_path = output_dir / "longitudinal_patient_summary.csv"
+    patient_df.to_csv(table_path, index=False)
+
+    cohort_df = (
+        patient_df.groupby("cohort")
+        .agg(
+            n_patients=("PTID", "count"),
+            mean_confidence=("mean_confidence", "mean"),
+            mean_std=("mean_std", "mean"),
+            mean_prob=("mean_prob", "mean"),
+        )
+        .reset_index()
+    )
+    cohort_table = output_dir / "longitudinal_cohort_summary.csv"
+    cohort_df.to_csv(cohort_table, index=False)
+
+    cohorts = ["stable_cn", "stable_ad", "cn_to_ad"]
+    uncertainty_specs = [
+        ("confidence", "mean_confidence", "Mean Confidence"),
+        (secondary_key, "mean_std", secondary_label),
+    ]
+    plot_paths: dict[str, str] = {}
+    for metric_name, metric_col, metric_label in uncertainty_specs:
+        values = [
+            np.asarray(
+                patient_df.loc[patient_df["cohort"] == c, metric_col], dtype=float
+            )
+            for c in cohorts
+        ]
+        plot_path = plots_dir(output_dir) / f"longitudinal_cohort_{metric_name}.png"
+        save_boxplot(
+            data=values,
+            tick_labels=cohorts,
+            x_label="Longitudinal Cohort",
+            y_label=metric_label,
+            title=f"Longitudinal Cohort Comparison: {metric_label} ({evaluation.backend})",
+            output_path=plot_path,
+        )
+        plot_paths[metric_name] = str(plot_path)
+
+    uncertainty_by_cohort = cohort_df.melt(
+        id_vars=["cohort", "n_patients"],
+        value_vars=["mean_confidence", "mean_std"],
+        var_name="uncertainty_type",
+        value_name="mean_value",
+    ).replace(
+        {
+            "uncertainty_type": {
+                "mean_confidence": "confidence",
+                "mean_std": secondary_key,
+            }
+        }
+    )
+    uncertainty_table = output_dir / "longitudinal_uncertainty_by_cohort.csv"
+    uncertainty_by_cohort.to_csv(uncertainty_table, index=False)
+
+    confidence_patient_table = (
+        output_dir / "longitudinal_confidence_patient_summary.csv"
+    )
+    std_patient_table = output_dir / "longitudinal_std_patient_summary.csv"
+    confidence_cohort_table = output_dir / "longitudinal_confidence_cohort_summary.csv"
+    std_cohort_table = output_dir / "longitudinal_std_cohort_summary.csv"
+    secondary_patient_table = (
+        output_dir / f"longitudinal_{secondary_key}_patient_summary.csv"
+    )
+    secondary_cohort_table = (
+        output_dir / f"longitudinal_{secondary_key}_cohort_summary.csv"
+    )
+
+    patient_df[
+        ["PTID", "n_visits", "first_dx", "last_dx", "cohort", "mean_confidence"]
+    ].to_csv(confidence_patient_table, index=False)
+    patient_df[
+        ["PTID", "n_visits", "first_dx", "last_dx", "cohort", "mean_std"]
+    ].to_csv(std_patient_table, index=False)
+    patient_df[
+        ["PTID", "n_visits", "first_dx", "last_dx", "cohort", "mean_std"]
+    ].to_csv(secondary_patient_table, index=False)
+    cohort_df[["cohort", "n_patients", "mean_confidence"]].to_csv(
+        confidence_cohort_table, index=False
+    )
+    cohort_df[["cohort", "n_patients", "mean_std"]].to_csv(
+        std_cohort_table, index=False
+    )
+    cohort_df[["cohort", "n_patients", "mean_std"]].to_csv(
+        secondary_cohort_table, index=False
+    )
+
+    out = {
+        "n_patients_analyzed": int(len(patient_df)),
+        "table_patient": str(table_path),
+        "table_cohort": str(cohort_table),
+        "table_uncertainty": str(uncertainty_table),
+        "tables": {
+            "confidence": {
+                "patient": str(confidence_patient_table),
+                "cohort": str(confidence_cohort_table),
+            },
+            secondary_key: {
+                "patient": str(secondary_patient_table),
+                "cohort": str(secondary_cohort_table),
+            },
+            "std": {
+                "patient": str(std_patient_table),
+                "cohort": str(std_cohort_table),
+            },
+        },
+        "plots": plot_paths,
+    }
+    write_json(output_dir / "longitudinal_summary.json", out)
+    return out

+ 175 - 0
analysis/data_access.py

@@ -0,0 +1,175 @@
+# pyright: basic
+
+from __future__ import annotations
+
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any
+
+import numpy as np
+import pandas as pd
+import xarray as xr
+from bayesian_torch.utils.util import predictive_entropy
+
+
+@dataclass
+class BackendEvaluation:
+    backend: str
+    source_file: Path
+    image_ids: np.ndarray
+    y_true: np.ndarray
+    y_prob: np.ndarray
+    uncertainty_confidence: np.ndarray
+    uncertainty_std: np.ndarray
+    uncertainty_metric: str
+
+
+def _resolve_dataset_path(model_output_dir: Path) -> Path:
+    primary = model_output_dir / "model_evaluation_results.nc"
+    if primary.exists():
+        return primary
+
+    candidates = sorted(model_output_dir.glob("*.nc"))
+    if not candidates:
+        raise FileNotFoundError(f"No netCDF file found under {model_output_dir}")
+
+    return candidates[0]
+
+
+def _positive_probability(
+    predictions: xr.DataArray,
+    class_index: int,
+) -> tuple[np.ndarray, np.ndarray, np.ndarray, str]:
+    if "img_class" not in predictions.dims:
+        raise ValueError("predictions is missing required dim: img_class")
+
+    if class_index >= predictions.sizes["img_class"]:
+        raise ValueError(
+            f"positive class index {class_index} is out of bounds for img_class size {predictions.sizes['img_class']}"
+        )
+
+    if "model" in predictions.dims:
+        class_probs = predictions.isel(img_class=class_index)
+        prob_mean = class_probs.mean(dim="model").values
+        # Confidence is the direct model output probability for the predicted class.
+        prob_std = class_probs.std(dim="model").values
+        return (
+            np.asarray(prob_mean, dtype=float),
+            np.asarray(prob_mean, dtype=float),
+            np.asarray(prob_std, dtype=float),
+            "std",
+        )
+
+    sample_like = [d for d in predictions.dims if d in {"sample", "mc_sample", "draw"}]
+    if sample_like:
+        dim = str(sample_like[0])
+        class_probs = predictions.isel(img_class=class_index)
+        prob_mean = class_probs.mean(dim=dim).values
+
+        # For Bayesian MC predictions, uncertainty should come from predictive
+        # entropy of the predictive distribution rather than classwise std.
+        mc_preds = predictions.transpose(dim, "img_id", "img_class").values
+        entropy_uncertainty = predictive_entropy(np.asarray(mc_preds, dtype=float))
+        return (
+            np.asarray(prob_mean, dtype=float),
+            np.asarray(prob_mean, dtype=float),
+            np.asarray(entropy_uncertainty, dtype=float),
+            "predictive_entropy",
+        )
+
+    prob = predictions.isel(img_class=class_index).values
+    return (
+        np.asarray(prob, dtype=float),
+        np.asarray(prob, dtype=float),
+        np.full_like(np.asarray(prob, dtype=float), np.nan),
+        "unknown",
+    )
+
+
+def _labels_to_binary(labels: xr.DataArray, class_index: int) -> np.ndarray:
+    if "label" in labels.dims:
+        if class_index >= labels.sizes["label"]:
+            raise ValueError(
+                f"positive class index {class_index} is out of bounds for label size {labels.sizes['label']}"
+            )
+
+        # One-hot labels expected in this repository.
+        binary = labels.argmax(dim="label").values == class_index
+        return np.asarray(binary, dtype=int)
+
+    # Fallback if labels are already binary.
+    return np.asarray(labels.values, dtype=int)
+
+
+def load_backend_evaluation(
+    config: dict[str, Any],
+    backend: str,
+    class_index: int,
+) -> BackendEvaluation:
+    output_key = f"{backend}_path"
+    if output_key not in config["output"]:
+        raise KeyError(f"Missing output path key in config: output.{output_key}")
+
+    model_output_dir = Path(config["output"][output_key]).expanduser().resolve()
+    ds_path = _resolve_dataset_path(model_output_dir)
+    ds = xr.open_dataset(ds_path)
+
+    if "predictions" not in ds or "labels" not in ds:
+        raise ValueError(
+            f"Dataset {ds_path} must contain predictions and labels variables"
+        )
+
+    predictions = ds["predictions"]
+    labels = ds["labels"]
+
+    if "img_id" in predictions.coords:
+        image_ids = np.asarray(predictions.coords["img_id"].values)
+    elif "img_id" in labels.coords:
+        image_ids = np.asarray(labels.coords["img_id"].values)
+    else:
+        length = predictions.sizes.get("img_id", labels.sizes.get("img_id"))
+        if length is None:
+            raise ValueError("Could not infer img_id length from predictions/labels")
+        image_ids = np.arange(length)
+
+    y_true = _labels_to_binary(labels, class_index=class_index)
+    y_prob, conf, y_std, uncertainty_metric = _positive_probability(
+        predictions, class_index=class_index
+    )
+
+    if len(y_true) != len(y_prob):
+        raise ValueError(
+            f"Length mismatch after loading backend {backend}: labels={len(y_true)}, probs={len(y_prob)}"
+        )
+
+    return BackendEvaluation(
+        backend=backend,
+        source_file=ds_path,
+        image_ids=image_ids,
+        y_true=y_true,
+        y_prob=y_prob,
+        uncertainty_confidence=conf,
+        uncertainty_std=y_std,
+        uncertainty_metric=uncertainty_metric,
+    )
+
+
+def load_clinical_table(config: dict[str, Any], root_dir: Path) -> pd.DataFrame:
+    csv_path = (root_dir / config["data"]["xls_file_path"]).resolve()
+    df = pd.read_csv(csv_path)
+    df.columns = df.columns.str.strip()
+    return df
+
+
+def physician_column(df: pd.DataFrame) -> str:
+    exact = "DXCONFID"
+    if exact in df.columns:
+        return exact
+
+    for col in df.columns:
+        if "dxconfid" in col.lower():
+            return col
+
+    raise KeyError(
+        "No physician confidence column with DXCONFID found in clinical table"
+    )

+ 88 - 0
analysis/data_pipeline.py

@@ -0,0 +1,88 @@
+# pyright: basic
+
+from __future__ import annotations
+
+from pathlib import Path
+from typing import Any
+
+import pandas as pd
+from torch.utils.data import ConcatDataset, DataLoader, Subset
+
+from data.dataset import (
+    ADNIDataset,
+    divide_dataset_by_patient_id,
+    initalize_dataloaders,
+    load_adni_data_from_file,
+)
+
+
+def xls_preprocess(df: pd.DataFrame) -> pd.DataFrame:
+    data = df[["Image Data ID", "Sex", "Age (current)"]].copy()
+    data["Sex"] = data["Sex"].astype(str).str.strip()
+    data = data.replace({"M": 0, "F": 1})
+    return data
+
+
+def _patient_ids(xls_file: Path) -> list[tuple[int, str]]:
+    ptid_df = pd.read_csv(xls_file)
+    ptid_df.columns = ptid_df.columns.str.strip()
+    ptid_df = ptid_df[["Image Data ID", "PTID"]].dropna(
+        subset=["Image Data ID", "PTID"]
+    )
+    ptid_df["Image Data ID"] = ptid_df["Image Data ID"].astype(int)
+    ptid_df["PTID"] = ptid_df["PTID"].astype(str).str.strip()
+    ptid_df = ptid_df[ptid_df["PTID"] != ""]
+    return list(zip(ptid_df["Image Data ID"].tolist(), ptid_df["PTID"].tolist()))
+
+
+def build_dataset(config: dict[str, Any], root_dir: Path) -> tuple[ADNIDataset, Path]:
+    mri_files = (root_dir / config["data"]["mri_files_path"]).resolve().glob("*.nii")
+    xls_file = (root_dir / config["data"]["xls_file_path"]).resolve()
+
+    dataset = load_adni_data_from_file(
+        mri_files,
+        xls_file,
+        device=config["training"]["device"],
+        xls_preprocessor=xls_preprocess,
+    )
+    return dataset, xls_file
+
+
+def build_dataset_splits(
+    config: dict[str, Any],
+    dataset: ADNIDataset,
+    xls_file: Path,
+    seed: int,
+) -> list[Subset[ADNIDataset]]:
+    return divide_dataset_by_patient_id(
+        dataset,
+        _patient_ids(xls_file),
+        tuple(config["data"]["data_splits"]),
+        seed=seed,
+    )
+
+
+def build_dataset_and_test_loader(
+    config: dict[str, Any],
+    root_dir: Path,
+    seed: int,
+) -> tuple[ADNIDataset, DataLoader]:
+    dataset, xls_file = build_dataset(config, root_dir)
+    splits = build_dataset_splits(config, dataset, xls_file, seed=seed)
+    _, _, test_loader = initalize_dataloaders(
+        splits,
+        batch_size=int(config["training"]["batch_size"]),
+    )
+    return dataset, test_loader
+
+
+def build_holdout_loader(
+    config: dict[str, Any],
+    root_dir: Path,
+    seed: int,
+) -> DataLoader:
+    dataset, xls_file = build_dataset(config, root_dir)
+    splits = build_dataset_splits(config, dataset, xls_file, seed=seed)
+    _, val_loader, test_loader = initalize_dataloaders(splits, batch_size=1)
+    combined = ConcatDataset([val_loader.dataset, test_loader.dataset])
+    return DataLoader(combined, batch_size=1, shuffle=False)

+ 157 - 0
analysis/dataset_summary.py

@@ -0,0 +1,157 @@
+from __future__ import annotations
+
+from copy import deepcopy
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import Any
+
+from analysis.data_pipeline import build_dataset, build_dataset_splits
+
+
+def _percent(part: int, whole: int) -> float:
+    if whole <= 0:
+        return 0.0
+    return 100.0 * float(part) / float(whole)
+
+
+def compute_dataset_summary(
+    config: dict[str, Any],
+    root_dir: Path,
+    positive_class_index: int,
+) -> dict[str, Any]:
+    # Force CPU so summary generation works without GPU availability.
+    summary_config = deepcopy(config)
+    summary_config.setdefault("training", {})
+    summary_config["training"]["device"] = "cpu"
+
+    dataset, xls_file = build_dataset(summary_config, root_dir)
+    seed = int(config["data"]["seed"])
+    splits = build_dataset_splits(summary_config, dataset, xls_file, seed=seed)
+
+    split_names = ["train", "validation", "test"]
+    requested_ratios = [float(v) for v in config["data"]["data_splits"]]
+
+    if len(splits) != 3:
+        raise ValueError(f"Expected 3 dataset splits, got {len(splits)}.")
+
+    total_images = int(len(dataset))
+    labels = (dataset.expected_classes[:, positive_class_index] >= 0.5).int()
+
+    splits_summary: list[dict[str, Any]] = []
+    assigned = 0
+    assigned_positive = 0
+
+    for split_name, requested_ratio, subset in zip(
+        split_names,
+        requested_ratios,
+        splits,
+        strict=True,
+    ):
+        indices = list(subset.indices)
+        split_count = int(len(indices))
+        split_positive = int(labels[indices].sum().item()) if split_count > 0 else 0
+        split_negative = split_count - split_positive
+
+        assigned += split_count
+        assigned_positive += split_positive
+
+        splits_summary.append(
+            {
+                "split": split_name,
+                "requested_ratio": requested_ratio,
+                "image_count": split_count,
+                "image_pct_of_dataset": _percent(split_count, total_images),
+                "positive_count": split_positive,
+                "negative_count": split_negative,
+                "positive_pct_within_split": _percent(split_positive, split_count),
+                "negative_pct_within_split": _percent(split_negative, split_count),
+            }
+        )
+
+    if assigned != total_images:
+        raise ValueError(
+            f"Split coverage mismatch: assigned {assigned} images, expected {total_images}."
+        )
+
+    total_positive = assigned_positive
+    total_negative = total_images - total_positive
+
+    return {
+        "generated_utc": datetime.now(timezone.utc).isoformat(),
+        "seed": seed,
+        "positive_class_index": int(positive_class_index),
+        "totals": {
+            "image_count": total_images,
+            "positive_count": total_positive,
+            "negative_count": total_negative,
+            "positive_pct": _percent(total_positive, total_images),
+            "negative_pct": _percent(total_negative, total_images),
+        },
+        "splits": splits_summary,
+    }
+
+
+def _markdown_table(summary: dict[str, Any]) -> str:
+    lines = [
+        "| Split | Requested % | Images | Dataset % | Positive | Negative | Positive % (split) | Negative % (split) |",
+        "|---|---:|---:|---:|---:|---:|---:|---:|",
+    ]
+
+    for item in summary["splits"]:
+        lines.append(
+            "| "
+            f"{item['split'].title()} | "
+            f"{item['requested_ratio'] * 100.0:.2f}% | "
+            f"{item['image_count']} | "
+            f"{item['image_pct_of_dataset']:.2f}% | "
+            f"{item['positive_count']} | "
+            f"{item['negative_count']} | "
+            f"{item['positive_pct_within_split']:.2f}% | "
+            f"{item['negative_pct_within_split']:.2f}% |"
+        )
+
+    return "\n".join(lines)
+
+
+def write_dataset_summary_markdown(summary: dict[str, Any], path: Path) -> None:
+    totals = summary["totals"]
+    lines = [
+        "# Dataset Composition Summary",
+        "",
+        f"Generated (UTC): {summary['generated_utc']}",
+        f"Split seed: {summary['seed']}",
+        f"Positive class index: {summary['positive_class_index']}",
+        "",
+        "## Overall",
+        "",
+        f"- Total images: {totals['image_count']}",
+        f"- Positive images: {totals['positive_count']} ({totals['positive_pct']:.2f}%)",
+        f"- Negative images: {totals['negative_count']} ({totals['negative_pct']:.2f}%)",
+        "",
+        "## Train / Validation / Test Breakdown",
+        "",
+        _markdown_table(summary),
+        "",
+    ]
+    path.write_text("\n".join(lines), encoding="utf-8")
+
+
+def run_dataset_summary(
+    config: dict[str, Any],
+    root_dir: Path,
+    output_dir: Path,
+    positive_class_index: int,
+) -> dict[str, Any]:
+    summary = compute_dataset_summary(
+        config=config,
+        root_dir=root_dir,
+        positive_class_index=positive_class_index,
+    )
+
+    markdown_path = output_dir / "dataset_summary.md"
+    write_dataset_summary_markdown(summary, markdown_path)
+
+    return {
+        "summary_markdown": str(markdown_path),
+        "summary": summary,
+    }

+ 45 - 0
analysis/defaults.py

@@ -0,0 +1,45 @@
+# pyright: basic
+
+from __future__ import annotations
+
+import numpy as np
+
+DEFAULT_BACKENDS = ["ensemble", "bayesian"]
+DEFAULT_POSITIVE_CLASS_INDEX = 0
+DEFAULT_CALIBRATION_BINS = 10
+DEFAULT_BAYESIAN_MC_PASSES = 20
+DEFAULT_DECISION_THRESHOLD = 0.5
+
+THRESHOLD_START = 0.5
+THRESHOLD_STOP = 0.95
+THRESHOLD_STEP = 0.05
+
+NOISE_FACTOR_MIN = 0.0
+NOISE_FACTOR_MAX = 7.0
+NOISE_FACTOR_COUNT = 21
+
+UNCERTAINTY_CUTOFF_COUNT = 21
+
+
+def threshold_grid() -> np.ndarray:
+    if THRESHOLD_STEP <= 0:
+        raise ValueError("THRESHOLD_STEP must be > 0")
+    if THRESHOLD_STOP < THRESHOLD_START:
+        raise ValueError("THRESHOLD_STOP must be >= THRESHOLD_START")
+
+    n = int(round((THRESHOLD_STOP - THRESHOLD_START) / THRESHOLD_STEP))
+    return np.linspace(THRESHOLD_START, THRESHOLD_STOP, num=n + 1, dtype=float)
+
+
+def uncertainty_cutoff_percentiles() -> np.ndarray:
+    return np.linspace(0.0, 100.0, num=UNCERTAINTY_CUTOFF_COUNT, dtype=float)
+
+
+def noise_factor_grid() -> list[float]:
+    factors = np.linspace(
+        NOISE_FACTOR_MIN,
+        NOISE_FACTOR_MAX,
+        num=NOISE_FACTOR_COUNT,
+        dtype=float,
+    )
+    return [float(v) for v in factors]

+ 402 - 0
analysis/evaluate_models.py

@@ -0,0 +1,402 @@
+# pyright: basic
+
+from __future__ import annotations
+
+import argparse
+from pathlib import Path
+from typing import Any
+
+import pandas as pd
+from tqdm.auto import tqdm
+
+from analysis.analysis_modules import (
+    run_calibration,
+    run_longitudinal,
+    run_performance,
+    run_physician,
+)
+from analysis.dataset_summary import run_dataset_summary
+from analysis.data_access import load_backend_evaluation, load_clinical_table
+from analysis.defaults import (
+    DEFAULT_BACKENDS,
+    DEFAULT_BAYESIAN_MC_PASSES,
+    DEFAULT_CALIBRATION_BINS,
+    DEFAULT_DECISION_THRESHOLD,
+    DEFAULT_POSITIVE_CLASS_INDEX,
+    noise_factor_grid,
+    threshold_grid,
+)
+from analysis.holdout_evaluation import ensure_backend_netcdf
+from analysis.longitudinal_audit import run_longitudinal_breakdown_audit
+from analysis.noise_correlation import run_noise_accuracy_uncertainty_analysis
+from analysis.noise_analysis import run_noise_analysis
+from analysis.runtime import backend_dir, init_runtime_paths, load_config, write_json
+
+
+def _plot_description(filename: str) -> str:
+    descriptions = {
+        "performance_threshold_accuracy.png": "Accuracy as the decision threshold varies.",
+        "performance_threshold_f1.png": "F1 score as the decision threshold varies.",
+        "performance_threshold_accuracy_f1.png": "Accuracy and F1 shown side-by-side as the decision threshold varies.",
+        "performance_uncertainty_cutoff_accuracy.png": "Accuracy while progressively restricting to higher-confidence and uncertainty-metric subsets.",
+        "performance_uncertainty_cutoff_f1.png": "F1 score while progressively restricting to higher-confidence and uncertainty-metric subsets.",
+        "performance_uncertainty_cutoff_accuracy_f1.png": "Accuracy and F1 shown side-by-side across uncertainty-cutoff restriction levels.",
+        "performance_uncertainty_percentile_cutoff_accuracy.png": "Accuracy from least to most restricted percentile-wise subset selection.",
+        "performance_uncertainty_percentile_cutoff_f1.png": "F1 score from least to most restricted percentile-wise subset selection.",
+        "performance_uncertainty_percentile_cutoff_accuracy_f1.png": "Accuracy and F1 shown side-by-side across percentile-floor restriction levels.",
+        "calibration_reliability.png": "Reliability diagram comparing predicted probability to empirical outcome frequency.",
+        "physician_confidence_boxplot.png": "Confidence grouped by physician confidence ratings.",
+        "physician_std_boxplot.png": "Standard deviation grouped by physician confidence ratings.",
+        "physician_predictive_entropy_boxplot.png": "Predictive uncertainty grouped by physician confidence ratings.",
+        "longitudinal_cohort_confidence.png": "Longitudinal cohort comparison using confidence.",
+        "longitudinal_cohort_std.png": "Longitudinal cohort comparison using standard deviation.",
+        "longitudinal_cohort_predictive_entropy.png": "Longitudinal cohort comparison using predictive uncertainty.",
+        "noise_sensitivity_accuracy.png": "Accuracy trend across increasing Gaussian noise factors.",
+        "noise_sensitivity_f1.png": "F1 trend across increasing Gaussian noise factors.",
+        "noise_sensitivity_accuracy_f1.png": "Accuracy and F1 shown side-by-side across increasing Gaussian noise factors.",
+        "noise_confidence.png": "Confidence trend across increasing Gaussian noise factors.",
+        "noise_standard_deviation.png": "Standard deviation trend across increasing Gaussian noise factors.",
+        "noise_confidence_standard_deviation.png": "Confidence and standard deviation shown side-by-side across increasing Gaussian noise factors.",
+        "noise_predictive_uncertainty.png": "Predictive uncertainty trend across increasing Gaussian noise factors.",
+        "noise_confidence_predictive_uncertainty.png": "Confidence and predictive uncertainty shown side-by-side across increasing Gaussian noise factors.",
+        "noise_accuracy_uncertainty_2d.png": "2D uncertainty-vs-accuracy relationship with linear fit (noise factor encoded by color).",
+        "ensemble_noise_examples.png": "Representative noisy image slices across selected Gaussian noise factors.",
+        "bayesian_noise_examples.png": "Representative noisy image slices across selected Gaussian noise factors.",
+        "ensemble_clean_scan_example.png": "Example clean scan image with no added noise.",
+        "bayesian_clean_scan_example.png": "Example clean scan image with no added noise.",
+    }
+    return descriptions.get(filename, "Generated analysis plot.")
+
+
+def _write_backend_plot_report(backend: str, out_dir: Path) -> Path:
+    plots_dir = out_dir / "plots"
+    images = sorted(plots_dir.rglob("*.png")) if plots_dir.exists() else []
+
+    report_path = out_dir / "plots_report.md"
+    lines = [
+        f"# {backend.title()} Analysis Plot Report",
+        "",
+        "This document lists generated analysis plots with brief descriptions.",
+        "",
+    ]
+    if not images:
+        lines.append("No plot images were generated for this backend run.")
+    else:
+        for image_path in images:
+            rel = image_path.relative_to(out_dir).as_posix()
+            title = image_path.stem.replace("_", " ").title()
+            lines.append(f"## {title}")
+            lines.append(_plot_description(image_path.name))
+            lines.append("")
+            lines.append(f"![{title}]({rel})")
+            lines.append("")
+
+    report_path.write_text("\n".join(lines), encoding="utf-8")
+    return report_path
+
+
+def _parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(
+        description=(
+            "Run modular evaluation analyses for ensemble and bayesian models. "
+            "All outputs are written to alnn_rewrite/analysis_output."
+        )
+    )
+    parser.add_argument(
+        "--backend",
+        nargs="+",
+        choices=["ensemble", "bayesian"],
+        default=DEFAULT_BACKENDS,
+        help="Backends to evaluate.",
+    )
+    parser.add_argument(
+        "--run-name",
+        default=None,
+        help="Optional run directory name under analysis_output.",
+    )
+    parser.add_argument(
+        "--skip-noise",
+        action="store_true",
+        help="Skip Gaussian noise sensitivity analysis.",
+    )
+    parser.add_argument(
+        "--longitudinal-breakdown-only",
+        action="store_true",
+        help=(
+            "Run only longitudinal cohort breakdown audit from existing model "
+            "evaluation outputs (no full analysis rerun)."
+        ),
+    )
+    parser.add_argument(
+        "--noise-correlation-only",
+        action="store_true",
+        help=(
+            "Run only the noise uncertainty-vs-accuracy correlation/regression "
+            "analysis from an existing noise_sensitivity.csv per backend."
+        ),
+    )
+    parser.add_argument(
+        "--dataset-summary-only",
+        action="store_true",
+        help=(
+            "Generate only dataset composition summary documentation "
+            "(overall and train/validation/test class breakdown)."
+        ),
+    )
+
+    args = parser.parse_args()
+    only_modes = [
+        bool(args.longitudinal_breakdown_only),
+        bool(args.noise_correlation_only),
+        bool(args.dataset_summary_only),
+    ]
+    if sum(only_modes) > 1:
+        parser.error(
+            "Only one of --longitudinal-breakdown-only, "
+            "--noise-correlation-only, and --dataset-summary-only may be used at once."
+        )
+
+    return args
+
+
+def _run_longitudinal_breakdown_only(
+    config: dict[str, Any],
+    backend: str,
+    clinical_df: pd.DataFrame,
+    out_dir: Path,
+) -> dict[str, Any]:
+    evaluation = load_backend_evaluation(
+        config=config,
+        backend=backend,
+        class_index=DEFAULT_POSITIVE_CLASS_INDEX,
+    )
+    summary = run_longitudinal_breakdown_audit(
+        evaluation=evaluation,
+        clinical_df=clinical_df,
+        output_dir=out_dir,
+    )
+    write_json(out_dir / "longitudinal_breakdown_backend_summary.json", summary)
+    return summary
+
+
+def _run_noise_correlation_only(
+    backend: str,
+    out_dir: Path,
+) -> dict[str, Any]:
+    noise_table_path = out_dir / "noise_sensitivity.csv"
+    if not noise_table_path.exists():
+        raise FileNotFoundError(
+            f"Expected existing noise table for --noise-correlation-only: {noise_table_path}"
+        )
+
+    noise_df = pd.read_csv(noise_table_path)
+    summary = run_noise_accuracy_uncertainty_analysis(
+        noise_df=noise_df,
+        backend=backend,
+        output_dir=out_dir,
+    )
+    write_json(out_dir / "noise_accuracy_uncertainty_backend_summary.json", summary)
+    return summary
+
+
+def _run_backend(
+    config: dict[str, Any],
+    root_dir: Path,
+    backend: str,
+    clinical_df: pd.DataFrame,
+    skip_noise: bool,
+    out_dir: Path,
+) -> dict[str, Any]:
+    netcdf_path = ensure_backend_netcdf(
+        config=config,
+        root_dir=root_dir,
+        backend=backend,
+        bayesian_mc_passes=DEFAULT_BAYESIAN_MC_PASSES,
+    )
+
+    evaluation = load_backend_evaluation(
+        config=config,
+        backend=backend,
+        class_index=DEFAULT_POSITIVE_CLASS_INDEX,
+    )
+
+    thresholds = threshold_grid()
+    noise_factors = noise_factor_grid()
+
+    summary: dict[str, Any] = {
+        "backend": backend,
+        "netcdf": str(netcdf_path),
+        "source_file": str(evaluation.source_file),
+        "uncertainty_metric": evaluation.uncertainty_metric,
+    }
+
+    n_stages = 4 + (0 if skip_noise else 2)
+    stage_bar = tqdm(
+        total=n_stages,
+        desc=f"[{backend}] analysis stages",
+        unit="stage",
+        leave=False,
+    )
+    try:
+        stage_bar.set_postfix_str("performance")
+        summary["performance"] = run_performance(
+            evaluation=evaluation,
+            output_dir=out_dir,
+            thresholds=thresholds,
+        )
+        stage_bar.update(1)
+
+        stage_bar.set_postfix_str("calibration")
+        summary["calibration"] = run_calibration(
+            evaluation=evaluation,
+            output_dir=out_dir,
+            bins=DEFAULT_CALIBRATION_BINS,
+        )
+        stage_bar.update(1)
+
+        stage_bar.set_postfix_str("physician")
+        summary["physician"] = run_physician(
+            evaluation=evaluation,
+            clinical_df=clinical_df,
+            output_dir=out_dir,
+        )
+        stage_bar.update(1)
+
+        stage_bar.set_postfix_str("longitudinal")
+        summary["longitudinal"] = run_longitudinal(
+            evaluation=evaluation,
+            clinical_df=clinical_df,
+            output_dir=out_dir,
+        )
+        stage_bar.update(1)
+
+        if skip_noise:
+            summary["noise"] = {"skipped": True, "reason": "--skip-noise supplied"}
+            summary["noise_accuracy_uncertainty"] = {
+                "skipped": True,
+                "reason": "Noise analysis skipped, so no noise table available.",
+            }
+        else:
+            try:
+                stage_bar.set_postfix_str("noise")
+                summary["noise"] = run_noise_analysis(
+                    config=config,
+                    root_dir=root_dir,
+                    backend=backend,
+                    output_dir=out_dir,
+                    class_index=DEFAULT_POSITIVE_CLASS_INDEX,
+                    noise_sigmas=noise_factors,
+                    threshold=DEFAULT_DECISION_THRESHOLD,
+                    calibration_bins=DEFAULT_CALIBRATION_BINS,
+                    bayesian_mc_passes=DEFAULT_BAYESIAN_MC_PASSES,
+                )
+                stage_bar.update(1)
+
+                stage_bar.set_postfix_str("noise-correlation")
+                noise_table_path = Path(str(summary["noise"]["table"]))
+                noise_df = pd.read_csv(noise_table_path)
+                summary["noise_accuracy_uncertainty"] = (
+                    run_noise_accuracy_uncertainty_analysis(
+                        noise_df=noise_df,
+                        backend=backend,
+                        output_dir=out_dir,
+                    )
+                )
+                stage_bar.update(1)
+            except Exception as exc:
+                summary["noise"] = {
+                    "skipped": True,
+                    "reason": f"Noise analysis failed: {exc}",
+                }
+                summary["noise_accuracy_uncertainty"] = {
+                    "skipped": True,
+                    "reason": f"Noise relationship analysis failed: {exc}",
+                }
+                stage_bar.update(2)
+    finally:
+        stage_bar.close()
+
+    report_path = _write_backend_plot_report(backend=backend, out_dir=out_dir)
+    summary["plots_report"] = str(report_path)
+    write_json(out_dir / "backend_summary.json", summary)
+    return summary
+
+
+def main() -> None:
+    args = _parse_args()
+
+    analysis_dir = Path(__file__).resolve().parent
+    paths = init_runtime_paths(analysis_dir=analysis_dir, run_name=args.run_name)
+    config = load_config(paths.root_dir)
+    clinical_df = load_clinical_table(config=config, root_dir=paths.root_dir)
+
+    manifest: dict[str, Any] = {
+        "run_dir": str(paths.run_dir),
+        "output_root": str(paths.output_root),
+        "mode": (
+            "dataset_summary_only"
+            if bool(args.dataset_summary_only)
+            else (
+                "longitudinal_breakdown_only"
+                if bool(args.longitudinal_breakdown_only)
+                else (
+                    "noise_correlation_only"
+                    if bool(args.noise_correlation_only)
+                    else "full"
+                )
+            )
+        ),
+        "positive_class_index": DEFAULT_POSITIVE_CLASS_INDEX,
+        "threshold_sweep": {
+            "values": [float(v) for v in threshold_grid().tolist()],
+        },
+        "calibration_bins": DEFAULT_CALIBRATION_BINS,
+        "noise_factors": noise_factor_grid(),
+        "bayesian_mc_passes": DEFAULT_BAYESIAN_MC_PASSES,
+        "decision_threshold": DEFAULT_DECISION_THRESHOLD,
+        "backends": {},
+    }
+
+    if args.dataset_summary_only:
+        manifest["dataset_summary"] = run_dataset_summary(
+            config=config,
+            root_dir=paths.root_dir,
+            output_dir=paths.run_dir,
+            positive_class_index=DEFAULT_POSITIVE_CLASS_INDEX,
+        )
+        write_json(paths.run_dir / "run_manifest.json", manifest)
+        print(f"Dataset summary complete. Results saved to {paths.run_dir}")
+        return
+
+    backend_iter = tqdm(args.backend, desc="Backends", unit="backend")
+    for backend in backend_iter:
+        out_dir = backend_dir(paths, backend)
+        backend_iter.set_postfix_str(backend)
+        if args.longitudinal_breakdown_only:
+            manifest["backends"][backend] = _run_longitudinal_breakdown_only(
+                config=config,
+                backend=backend,
+                clinical_df=clinical_df,
+                out_dir=out_dir,
+            )
+        elif args.noise_correlation_only:
+            manifest["backends"][backend] = _run_noise_correlation_only(
+                backend=backend,
+                out_dir=out_dir,
+            )
+        else:
+            manifest["backends"][backend] = _run_backend(
+                config=config,
+                root_dir=paths.root_dir,
+                backend=backend,
+                clinical_df=clinical_df,
+                skip_noise=bool(args.skip_noise),
+                out_dir=out_dir,
+            )
+
+    write_json(paths.run_dir / "run_manifest.json", manifest)
+    print(f"Analysis complete. Results saved to {paths.run_dir}")
+
+
+if __name__ == "__main__":
+    main()

+ 239 - 0
analysis/holdout_evaluation.py

@@ -0,0 +1,239 @@
+# pyright: basic
+
+from __future__ import annotations
+
+import json
+from pathlib import Path
+from typing import Any
+
+import numpy as np
+import torch
+from torch.utils.data import DataLoader
+from tqdm.auto import tqdm
+import xarray as xr
+
+from model.cnn import CNN3D
+
+from .data_pipeline import build_holdout_loader
+from .model_utils import configure_bayesian_sampling_mode
+
+
+def _training_seed(config: dict[str, Any], backend_dir: Path) -> int:
+    config_json = backend_dir / "config.json"
+    if config_json.exists():
+        try:
+            with config_json.open("r", encoding="utf-8") as f:
+                trained = json.load(f)
+            return int(trained["data"]["seed"])
+        except Exception:
+            pass
+    return int(config["data"]["seed"])
+
+
+def _init_cnn(config: dict[str, Any]) -> torch.nn.Module:
+    return (
+        CNN3D(
+            image_channels=int(config["data"]["image_channels"]),
+            clin_data_channels=int(config["data"]["clin_data_channels"]),
+            num_classes=int(config["data"]["num_classes"]),
+            droprate=float(config["training"]["droprate"]),
+        )
+        .float()
+        .to(config["training"]["device"])
+    )
+
+
+def _extract_img_id(img_id_tensor: Any) -> int:
+    if hasattr(img_id_tensor, "detach"):
+        return int(img_id_tensor.detach().cpu().item())
+    return int(img_id_tensor)
+
+
+def _evaluate_ensemble(
+    config: dict[str, Any],
+    backend_dir: Path,
+    holdout_loader: DataLoader,
+) -> xr.Dataset:
+    device = str(config["training"]["device"])
+    model_files = sorted(backend_dir.glob("model_run_*.pt"))
+    if not model_files:
+        raise FileNotFoundError(f"No ensemble checkpoints found in {backend_dir}")
+
+    n_models = len(model_files)
+    n_samples = len(holdout_loader)
+    n_classes = int(config["data"]["num_classes"])
+
+    predictions = np.zeros((n_models, n_samples, n_classes), dtype=np.float32)
+    labels = np.zeros((n_samples, n_classes), dtype=np.float32)
+    image_ids = np.zeros((n_samples,), dtype=int)
+
+    model_iter = tqdm(model_files, desc="Ensemble checkpoints", unit="model")
+    for model_i, model_file in enumerate(model_iter):
+        model_iter.set_postfix_str(model_file.name)
+        model = _init_cnn(config)
+        model.load_state_dict(
+            torch.load(model_file, map_location=device),
+            strict=False,
+        )
+        model.to(device)
+        model.eval()
+
+        with torch.no_grad():
+            sample_iter = tqdm(
+                holdout_loader,
+                total=n_samples,
+                desc=f"{model_file.name}",
+                unit="batch",
+                leave=False,
+            )
+            for sample_i, (mri, xls, label, img_id) in enumerate(sample_iter):
+                mri_device = mri.float().to(device)
+                xls_device = xls.float().to(device)
+                output = model((mri_device, xls_device))
+                predictions[model_i, sample_i, :] = output.detach().cpu().numpy()[0, :]
+
+                if model_i == 0:
+                    labels[sample_i, :] = label.detach().cpu().numpy()[0, :]
+                    image_ids[sample_i] = _extract_img_id(img_id)
+
+    model_coord = [int(f.stem.split("_")[2]) for f in model_files]
+    return xr.Dataset(
+        {
+            "predictions": xr.DataArray(
+                predictions,
+                dims=["model", "img_id", "img_class"],
+                coords={
+                    "model": model_coord,
+                    "img_id": image_ids,
+                    "img_class": list(range(n_classes)),
+                },
+            ),
+            "labels": xr.DataArray(
+                labels,
+                dims=["img_id", "label"],
+                coords={
+                    "img_id": image_ids,
+                    "label": list(range(n_classes)),
+                },
+            ),
+        }
+    )
+
+
+def _evaluate_bayesian(
+    config: dict[str, Any],
+    backend_dir: Path,
+    holdout_loader: DataLoader,
+    mc_passes: int,
+) -> xr.Dataset:
+    device = str(config["training"]["device"])
+    try:
+        from bayesian_torch.models.dnn_to_bnn import dnn_to_bnn  # type: ignore[import-untyped]
+    except ImportError as e:
+        raise ImportError(
+            "bayesian_torch is required to evaluate bayesian checkpoints"
+        ) from e
+
+    ckpt = backend_dir / "model_bayesian.pt"
+    if not ckpt.exists():
+        raise FileNotFoundError(f"Bayesian checkpoint not found: {ckpt}")
+
+    model = _init_cnn(config)
+    prior_params: dict[str, float | bool | str] = {
+        "prior_mu": 0.0,
+        "prior_sigma": 1.0,
+        "posterior_mu_init": 0.0,
+        "posterior_rho_init": -3.0,
+        "type": "Reparameterization",
+        "moped_enable": False,
+        "moped_delta": 0.5,
+    }
+    dnn_to_bnn(model, prior_params)
+    model.to(device)
+    model.load_state_dict(torch.load(ckpt, map_location=device), strict=False)
+    model.to(device)
+    configure_bayesian_sampling_mode(
+        model,
+        stochastic=False,
+        freeze_batchnorm=True,
+    )
+
+    n_samples = len(holdout_loader)
+    n_classes = int(config["data"]["num_classes"])
+
+    predictions = np.zeros((mc_passes, n_samples, n_classes), dtype=np.float32)
+    labels = np.zeros((n_samples, n_classes), dtype=np.float32)
+    image_ids = np.zeros((n_samples,), dtype=int)
+
+    with torch.no_grad():
+        pass_iter = tqdm(range(mc_passes), desc="Bayesian MC passes", unit="pass")
+        for pass_i in pass_iter:
+            pass_iter.set_postfix_str(f"pass={pass_i + 1}/{mc_passes}")
+            sample_iter = tqdm(
+                holdout_loader,
+                total=n_samples,
+                desc=f"MC pass {pass_i + 1}",
+                unit="batch",
+                leave=False,
+            )
+            for sample_i, (mri, xls, label, img_id) in enumerate(sample_iter):
+                mri_device = mri.float().to(device)
+                xls_device = xls.float().to(device)
+                output = model((mri_device, xls_device))
+                predictions[pass_i, sample_i, :] = output.detach().cpu().numpy()[0, :]
+
+                if pass_i == 0:
+                    labels[sample_i, :] = label.detach().cpu().numpy()[0, :]
+                    image_ids[sample_i] = _extract_img_id(img_id)
+
+    return xr.Dataset(
+        {
+            "predictions": xr.DataArray(
+                predictions,
+                dims=["sample", "img_id", "img_class"],
+                coords={
+                    "sample": list(range(mc_passes)),
+                    "img_id": image_ids,
+                    "img_class": list(range(n_classes)),
+                },
+            ),
+            "labels": xr.DataArray(
+                labels,
+                dims=["img_id", "label"],
+                coords={
+                    "img_id": image_ids,
+                    "label": list(range(n_classes)),
+                },
+            ),
+        }
+    )
+
+
+def ensure_backend_netcdf(
+    config: dict[str, Any],
+    root_dir: Path,
+    backend: str,
+    bayesian_mc_passes: int,
+) -> Path:
+    backend_dir = Path(config["output"][f"{backend}_path"]).expanduser().resolve()
+    backend_dir.mkdir(parents=True, exist_ok=True)
+
+    output_path = backend_dir / "model_evaluation_results.nc"
+    if output_path.exists():
+        return output_path
+
+    holdout_loader = build_holdout_loader(
+        config=config,
+        root_dir=root_dir,
+        seed=_training_seed(config, backend_dir),
+    )
+
+    if backend == "ensemble":
+        ds = _evaluate_ensemble(config, backend_dir, holdout_loader)
+    elif backend == "bayesian":
+        ds = _evaluate_bayesian(config, backend_dir, holdout_loader, bayesian_mc_passes)
+    else:
+        raise ValueError(f"Unsupported backend: {backend}")
+
+    ds.to_netcdf(output_path, mode="w")
+    return output_path

+ 295 - 0
analysis/longitudinal_audit.py

@@ -0,0 +1,295 @@
+# pyright: basic
+
+from __future__ import annotations
+
+from pathlib import Path
+from typing import Any
+
+import numpy as np
+import pandas as pd
+
+from .data_access import BackendEvaluation
+from .runtime import write_json
+
+
+def _normalize_dx(value: Any) -> str:
+    if value is None or (isinstance(value, float) and np.isnan(value)):
+        return ""
+
+    v = str(value).strip().upper()
+    if v in {"NL", "NORMAL"}:
+        return "CN"
+    return v
+
+
+def _is_mci(value: str) -> bool:
+    return "MCI" in str(value).strip().upper()
+
+
+def _assign_cohort(diagnoses: list[str]) -> str:
+    if len(diagnoses) < 2:
+        return "insufficient_visits"
+
+    first_dx = diagnoses[0]
+    last_dx = diagnoses[-1]
+    unique_dx = set(diagnoses)
+
+    if unique_dx.issubset({"CN"}):
+        return "stable_cn"
+    if unique_dx.issubset({"AD"}):
+        return "stable_ad"
+    if all(_is_mci(dx) for dx in unique_dx):
+        return "stable_mci"
+    if first_dx == "CN" and "AD" in unique_dx and last_dx == "AD":
+        return "cn_to_ad"
+    if first_dx == "CN" and _is_mci(last_dx):
+        return "cn_to_mci"
+    if _is_mci(first_dx) and last_dx == "AD":
+        return "mci_to_ad"
+
+    return "other"
+
+
+def _prepare_clinical_longitudinal_table(clinical_df: pd.DataFrame) -> pd.DataFrame:
+    required = ["Image Data ID", "PTID"]
+    missing = [c for c in required if c not in clinical_df.columns]
+    if missing:
+        raise KeyError(f"Missing columns for longitudinal analysis: {missing}")
+
+    diagnosis_col = None
+    for candidate in ["Class", "DX", "Diagnosis"]:
+        if candidate in clinical_df.columns:
+            diagnosis_col = candidate
+            break
+    if diagnosis_col is None:
+        raise KeyError(
+            "No diagnosis column found. Expected one of: Class, DX, Diagnosis"
+        )
+
+    work = clinical_df[
+        ["Image Data ID", "PTID", diagnosis_col]
+        + [c for c in ["EXAMDATE"] if c in clinical_df.columns]
+    ].copy()
+
+    work["Image Data ID"] = pd.to_numeric(work["Image Data ID"], errors="coerce")
+    work = work.dropna(subset=["Image Data ID", "PTID"])
+    work["Image Data ID"] = work["Image Data ID"].astype(int)
+    work["PTID"] = work["PTID"].astype(str).str.strip()
+    work["diagnosis"] = work[diagnosis_col].map(_normalize_dx)
+
+    if "EXAMDATE" in work.columns:
+        work["EXAMDATE"] = pd.to_datetime(
+            work["EXAMDATE"],
+            errors="coerce",
+            format="mixed",
+        )
+        work = work.sort_values(["PTID", "EXAMDATE"], na_position="last")
+    else:
+        work = work.sort_values(["PTID", "Image Data ID"])
+
+    return work
+
+
+def _build_patient_table(work: pd.DataFrame) -> pd.DataFrame:
+    rows: list[dict[str, Any]] = []
+    for ptid, group in work.groupby("PTID"):
+        diagnoses = [d for d in group["diagnosis"].tolist() if d]
+        cohort = _assign_cohort(diagnoses)
+        first_dx = diagnoses[0] if diagnoses else ""
+        last_dx = diagnoses[-1] if diagnoses else ""
+        rows.append(
+            {
+                "PTID": ptid,
+                "n_visits": int(len(group)),
+                "n_labeled_visits": int(len(diagnoses)),
+                "first_dx": first_dx,
+                "last_dx": last_dx,
+                "cohort": cohort,
+            }
+        )
+
+    if not rows:
+        return pd.DataFrame(
+            columns=[
+                "PTID",
+                "n_visits",
+                "n_labeled_visits",
+                "first_dx",
+                "last_dx",
+                "cohort",
+            ]
+        )
+
+    return pd.DataFrame(rows)
+
+
+def _count_table(df: pd.DataFrame, column: str, count_name: str) -> pd.DataFrame:
+    if column not in df.columns or df.empty:
+        return pd.DataFrame({column: [], count_name: []})
+    return (
+        df.groupby(column, dropna=False)
+        .size()
+        .rename(count_name)
+        .reset_index()
+        .sort_values(count_name, ascending=False)
+    )
+
+
+def run_longitudinal_breakdown_audit(
+    evaluation: BackendEvaluation,
+    clinical_df: pd.DataFrame,
+    output_dir: Path,
+) -> dict[str, Any]:
+    output_dir.mkdir(parents=True, exist_ok=True)
+
+    full_table = _prepare_clinical_longitudinal_table(clinical_df)
+    evaluable_table = full_table[
+        full_table["Image Data ID"].isin(evaluation.image_ids.astype(int))
+    ].copy()
+
+    full_patient = _build_patient_table(full_table)
+    evaluable_patient = _build_patient_table(evaluable_table)
+
+    full_patient_min2 = full_patient[full_patient["n_labeled_visits"] >= 2].copy()
+    evaluable_patient_min2 = evaluable_patient[
+        evaluable_patient["n_labeled_visits"] >= 2
+    ].copy()
+
+    full_diagnosis_counts = _count_table(full_table, "diagnosis", "n_rows")
+    evaluable_diagnosis_counts = _count_table(evaluable_table, "diagnosis", "n_rows")
+
+    full_transition_counts = (
+        full_patient_min2.groupby(["first_dx", "last_dx"])
+        .size()
+        .rename("n_patients")
+        .reset_index()
+    )
+    full_transition_counts = full_transition_counts.sort_values(
+        "n_patients", ascending=False
+    )
+
+    evaluable_transition_counts = (
+        evaluable_patient_min2.groupby(["first_dx", "last_dx"])
+        .size()
+        .rename("n_patients")
+        .reset_index()
+    )
+    evaluable_transition_counts = evaluable_transition_counts.sort_values(
+        "n_patients", ascending=False
+    )
+
+    full_cohort_counts = _count_table(full_patient_min2, "cohort", "n_patients")
+    evaluable_cohort_counts = _count_table(
+        evaluable_patient_min2, "cohort", "n_patients"
+    )
+
+    paths = {
+        "full_diagnosis_counts": output_dir / "longitudinal_diagnosis_counts_all.csv",
+        "evaluable_diagnosis_counts": output_dir
+        / "longitudinal_diagnosis_counts_evaluable.csv",
+        "full_transition_counts": output_dir / "longitudinal_transition_counts_all.csv",
+        "evaluable_transition_counts": output_dir
+        / "longitudinal_transition_counts_evaluable.csv",
+        "full_cohort_counts": output_dir / "longitudinal_cohort_counts_all.csv",
+        "evaluable_cohort_counts": output_dir
+        / "longitudinal_cohort_counts_evaluable.csv",
+        "full_patient_table": output_dir / "longitudinal_patient_breakdown_all.csv",
+        "evaluable_patient_table": output_dir
+        / "longitudinal_patient_breakdown_evaluable.csv",
+        "summary_md": output_dir / "longitudinal_breakdown_summary.md",
+        "summary_json": output_dir / "longitudinal_breakdown_summary.json",
+    }
+
+    full_diagnosis_counts.to_csv(paths["full_diagnosis_counts"], index=False)
+    evaluable_diagnosis_counts.to_csv(paths["evaluable_diagnosis_counts"], index=False)
+    full_transition_counts.to_csv(paths["full_transition_counts"], index=False)
+    evaluable_transition_counts.to_csv(
+        paths["evaluable_transition_counts"], index=False
+    )
+    full_cohort_counts.to_csv(paths["full_cohort_counts"], index=False)
+    evaluable_cohort_counts.to_csv(paths["evaluable_cohort_counts"], index=False)
+    full_patient.to_csv(paths["full_patient_table"], index=False)
+    evaluable_patient.to_csv(paths["evaluable_patient_table"], index=False)
+
+    summary = {
+        "backend": evaluation.backend,
+        "n_evaluation_images": int(len(evaluation.image_ids)),
+        "n_clinical_rows_total": int(len(full_table)),
+        "n_clinical_rows_evaluable": int(len(evaluable_table)),
+        "n_patients_total": (
+            int(full_patient["PTID"].nunique()) if not full_patient.empty else 0
+        ),
+        "n_patients_evaluable": (
+            int(evaluable_patient["PTID"].nunique())
+            if not evaluable_patient.empty
+            else 0
+        ),
+        "n_patients_total_min2_visits": int(len(full_patient_min2)),
+        "n_patients_evaluable_min2_visits": int(len(evaluable_patient_min2)),
+        "n_patients_evaluable_single_visit_only": (
+            int((evaluable_patient["n_labeled_visits"] < 2).sum())
+            if not evaluable_patient.empty
+            else 0
+        ),
+        "tables": {
+            "full_diagnosis_counts": str(paths["full_diagnosis_counts"]),
+            "evaluable_diagnosis_counts": str(paths["evaluable_diagnosis_counts"]),
+            "full_transition_counts": str(paths["full_transition_counts"]),
+            "evaluable_transition_counts": str(paths["evaluable_transition_counts"]),
+            "full_cohort_counts": str(paths["full_cohort_counts"]),
+            "evaluable_cohort_counts": str(paths["evaluable_cohort_counts"]),
+            "full_patient_table": str(paths["full_patient_table"]),
+            "evaluable_patient_table": str(paths["evaluable_patient_table"]),
+        },
+    }
+    write_json(paths["summary_json"], summary)
+
+    cohort_text = "(none)"
+    if not evaluable_cohort_counts.empty:
+        cohort_text = "\n".join(
+            [
+                f"- {str(r['cohort'])}: {int(r['n_patients'])}"
+                for _, r in evaluable_cohort_counts.iterrows()
+            ]
+        )
+
+    lines = [
+        f"# Longitudinal Breakdown Audit ({evaluation.backend})",
+        "",
+        "This report explains the patient/cohort counts used by the longitudinal analysis.",
+        "",
+        "## Key Counts",
+        f"- Evaluation images available: {summary['n_evaluation_images']}",
+        f"- Clinical rows total: {summary['n_clinical_rows_total']}",
+        f"- Clinical rows overlapping evaluated images: {summary['n_clinical_rows_evaluable']}",
+        f"- Patients total (any visits): {summary['n_patients_total']}",
+        f"- Patients with overlapping evaluated images: {summary['n_patients_evaluable']}",
+        f"- Patients with >=2 labeled visits (full clinical): {summary['n_patients_total_min2_visits']}",
+        f"- Patients with >=2 labeled visits (evaluable overlap): {summary['n_patients_evaluable_min2_visits']}",
+        f"- Overlap patients with only 1 labeled visit: {summary['n_patients_evaluable_single_visit_only']}",
+        "",
+        "## Evaluable Cohort Counts (>=2 labeled visits)",
+        cohort_text,
+        "",
+        "## Interpretation",
+        "- The longitudinal plot uses only patients with at least two labeled visits after intersection with the evaluated image IDs.",
+        "- If MCI transition cohorts are missing, it typically means those patients do not have enough overlapping evaluated visits (or labels) in this run.",
+        "- Use the CSV tables below to inspect diagnosis and transition distributions in full clinical data versus evaluable overlap.",
+        "",
+        "## Output Tables",
+        f"- {paths['full_diagnosis_counts'].name}",
+        f"- {paths['evaluable_diagnosis_counts'].name}",
+        f"- {paths['full_transition_counts'].name}",
+        f"- {paths['evaluable_transition_counts'].name}",
+        f"- {paths['full_cohort_counts'].name}",
+        f"- {paths['evaluable_cohort_counts'].name}",
+        f"- {paths['full_patient_table'].name}",
+        f"- {paths['evaluable_patient_table'].name}",
+    ]
+    paths["summary_md"].write_text("\n".join(lines), encoding="utf-8")
+
+    return {
+        **summary,
+        "summary_markdown": str(paths["summary_md"]),
+        "summary_json": str(paths["summary_json"]),
+    }

+ 108 - 0
analysis/metrics.py

@@ -0,0 +1,108 @@
+# pyright: basic
+
+from __future__ import annotations
+
+import numpy as np
+
+
+def binary_confusion(
+    y_true: np.ndarray, y_prob: np.ndarray, threshold: float
+) -> dict[str, int]:
+    y_true_int = y_true.astype(int)
+    y_pred = (y_prob >= threshold).astype(int)
+
+    tp = int(np.logical_and(y_pred == 1, y_true_int == 1).sum())
+    fp = int(np.logical_and(y_pred == 1, y_true_int == 0).sum())
+    tn = int(np.logical_and(y_pred == 0, y_true_int == 0).sum())
+    fn = int(np.logical_and(y_pred == 0, y_true_int == 1).sum())
+
+    return {"tp": tp, "fp": fp, "tn": tn, "fn": fn}
+
+
+def _safe_div(num: float, den: float) -> float:
+    if den == 0:
+        return 0.0
+    return num / den
+
+
+def performance_at_threshold(
+    y_true: np.ndarray,
+    y_prob: np.ndarray,
+    threshold: float,
+) -> dict[str, float]:
+    c = binary_confusion(y_true, y_prob, threshold)
+    tp = c["tp"]
+    fp = c["fp"]
+    tn = c["tn"]
+    fn = c["fn"]
+
+    total = tp + fp + tn + fn
+    accuracy = _safe_div(tp + tn, total)
+    precision = _safe_div(tp, tp + fp)
+    recall = _safe_div(tp, tp + fn)
+    f1 = _safe_div(2 * precision * recall, precision + recall)
+
+    return {
+        "threshold": float(threshold),
+        "accuracy": float(accuracy),
+        "precision": float(precision),
+        "recall": float(recall),
+        "f1": float(f1),
+        "tp": float(tp),
+        "fp": float(fp),
+        "tn": float(tn),
+        "fn": float(fn),
+    }
+
+
+def threshold_sweep(
+    y_true: np.ndarray,
+    y_prob: np.ndarray,
+    thresholds: np.ndarray,
+) -> list[dict[str, float]]:
+    return [performance_at_threshold(y_true, y_prob, float(t)) for t in thresholds]
+
+
+def calibration_stats(
+    y_true: np.ndarray,
+    y_prob: np.ndarray,
+    bins: int = 10,
+) -> tuple[dict[str, float], np.ndarray]:
+    y_true_int = y_true.astype(int)
+    y_prob_f = y_prob.astype(float)
+
+    edges = np.linspace(0.0, 1.0, bins + 1)
+    bin_data: list[tuple[float, float, int]] = []
+    mce = 0.0
+
+    n = len(y_prob_f)
+    for i in range(bins):
+        lo = edges[i]
+        hi = edges[i + 1]
+        if i == bins - 1:
+            mask = (y_prob_f >= lo) & (y_prob_f <= hi)
+        else:
+            mask = (y_prob_f >= lo) & (y_prob_f < hi)
+
+        count = int(mask.sum())
+        if count == 0:
+            bin_data.append((float((lo + hi) / 2.0), np.nan, 0))
+            continue
+
+        mean_conf = float(y_prob_f[mask].mean())
+        frac_pos = float(y_true_int[mask].mean())
+        gap = abs(frac_pos - mean_conf)
+        mce = max(mce, gap)
+
+        bin_data.append((mean_conf, frac_pos, count))
+
+    brier = float(np.mean((y_prob_f - y_true_int) ** 2))
+
+    summary = {
+        "mce": float(mce),
+        "brier": brier,
+        "bins": float(bins),
+    }
+
+    arr = np.array(bin_data, dtype=float)
+    return summary, arr

+ 46 - 0
analysis/model_utils.py

@@ -0,0 +1,46 @@
+# pyright: basic
+
+from __future__ import annotations
+
+import torch
+from data.dataset import ADNIDataset
+
+
+def configure_bayesian_sampling_mode(
+    model: torch.nn.Module,
+    *,
+    stochastic: bool,
+    freeze_batchnorm: bool = True,
+) -> None:
+    # if stochastic:
+    #     model.train()
+    #     if freeze_batchnorm:
+    #         for module in model.modules():
+    #             if isinstance(module, torch.nn.modules.batchnorm._BatchNorm):
+    #                 module.eval()
+    # else:
+    model.eval()
+
+
+def _compute_mri_intensity_range(
+    dataset: ADNIDataset,
+) -> tuple[float, float, float]:
+    global_min = float("inf")
+    global_max = float("-inf")
+    for idx in range(len(dataset)):
+        mri, _, _, _ = dataset[idx]
+        sample_min = float(mri.detach().min().cpu())
+        sample_max = float(mri.detach().max().cpu())
+        if sample_min < global_min:
+            global_min = sample_min
+        if sample_max > global_max:
+            global_max = sample_max
+
+    intensity_range = global_max - global_min
+    if not np.isfinite(intensity_range) or intensity_range <= 0:
+        intensity_range = 1.0
+    print(
+        "Computed MRI intensity range across the full dataset: "
+        f"min={global_min:.6g}, max={global_max:.6g}, range={intensity_range:.6g}"
+    )
+    return float(global_min), float(global_max), float(intensity_range)

+ 486 - 0
analysis/noise_analysis.py

@@ -0,0 +1,486 @@
+# pyright: basic
+
+from __future__ import annotations
+
+from pathlib import Path
+from typing import Any
+
+import numpy as np
+import pandas as pd
+import torch
+from bayesian_torch.utils.util import predictive_entropy
+from tqdm.auto import tqdm
+
+from model.cnn import CNN3D
+
+from .data_pipeline import build_holdout_loader
+from .metrics import calibration_stats, performance_at_threshold
+from .model_utils import configure_bayesian_sampling_mode
+from .plotting import (
+    plots_dir,
+    save_clean_scan_image,
+    save_noise_example_grid,
+    save_metric_pair_plot,
+    save_noise_metrics_plot,
+)
+from .runtime import write_json
+
+
+def _apply_scaled_noise(
+    volume: torch.Tensor, sigma: float, intensity_range: float
+) -> torch.Tensor:
+    # Scale by global MRI intensity range measured from holdout set.
+    return volume + (torch.randn_like(volume) * sigma * intensity_range)
+
+
+def _uniform_sigma_schedule(noise_sigmas: list[float]) -> list[float]:
+    if not noise_sigmas:
+        raise ValueError("noise_sigmas must contain at least one value")
+
+    ordered = np.array(sorted(float(s) for s in noise_sigmas), dtype=float)
+    if len(ordered) == 1:
+        return [float(ordered[0])]
+
+    uniform = np.linspace(
+        float(ordered[0]), float(ordered[-1]), num=len(ordered), dtype=float
+    )
+    return [float(s) for s in uniform]
+
+
+def _load_ensemble_models(config: dict[str, Any]) -> list[torch.nn.Module]:
+    model_dir = Path(config["output"]["ensemble_path"])
+    model_files = sorted(model_dir.glob("model_run_*.pt"))
+    if not model_files:
+        raise FileNotFoundError(f"No ensemble model files found in {model_dir}")
+
+    models: list[torch.nn.Module] = []
+    for model_file in model_files:
+        model = (
+            CNN3D(
+                image_channels=int(config["data"]["image_channels"]),
+                clin_data_channels=int(config["data"]["clin_data_channels"]),
+                num_classes=int(config["data"]["num_classes"]),
+                droprate=float(config["training"]["droprate"]),
+            )
+            .float()
+            .to(config["training"]["device"])
+        )
+        model.load_state_dict(
+            torch.load(model_file, map_location=config["training"]["device"]),
+            strict=False,
+        )
+        model.eval()
+        models.append(model)
+
+    return models
+
+
+def _load_bayesian_model(config: dict[str, Any]) -> torch.nn.Module:
+    device = str(config["training"]["device"])
+    try:
+        from bayesian_torch.models.dnn_to_bnn import dnn_to_bnn  # type: ignore[import-untyped]
+    except ImportError as e:
+        raise ImportError(
+            "bayesian_torch is required for bayesian noise analysis"
+        ) from e
+
+    model_path = Path(config["output"]["bayesian_path"]) / "model_bayesian.pt"
+    if not model_path.exists():
+        raise FileNotFoundError(f"Bayesian model checkpoint not found: {model_path}")
+
+    model = (
+        CNN3D(
+            image_channels=int(config["data"]["image_channels"]),
+            clin_data_channels=int(config["data"]["clin_data_channels"]),
+            num_classes=int(config["data"]["num_classes"]),
+            droprate=float(config["training"]["droprate"]),
+        )
+        .float()
+        .to(config["training"]["device"])
+    )
+
+    prior_params: dict[str, float | bool | str] = {
+        "prior_mu": 0.0,
+        "prior_sigma": 1.0,
+        "posterior_mu_init": 0.0,
+        "posterior_rho_init": -3.0,
+        "type": "Reparameterization",
+        "moped_enable": False,
+        "moped_delta": 0.5,
+    }
+    dnn_to_bnn(model, prior_params)
+    model.to(device)
+    model.load_state_dict(torch.load(model_path, map_location=device), strict=False)
+    model.to(device)
+    configure_bayesian_sampling_mode(model, stochastic=False)
+    return model
+
+
+def _infer_with_noise_ensemble(
+    test_loader: torch.utils.data.DataLoader,
+    models: list[torch.nn.Module],
+    sigma: float,
+    intensity_range: float,
+    class_index: int,
+) -> tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
+    if not models:
+        raise ValueError("No ensemble models were provided for noise inference")
+
+    device = next(models[0].parameters()).device
+    all_probs: list[float] = []
+    all_confidence: list[float] = []
+    all_stds: list[float] = []
+    all_true: list[int] = []
+
+    with torch.no_grad():
+        batch_iter = tqdm(
+            test_loader,
+            total=len(test_loader),
+            desc=f"ensemble sigma={sigma:g}",
+            unit="batch",
+            leave=False,
+        )
+        for mri, xls, labels, _ in batch_iter:
+            mri_device = mri.float().to(device)
+            xls_device = xls.float().to(device)
+            labels_device = labels.to(device)
+            noisy = _apply_scaled_noise(mri_device, sigma, intensity_range)
+            preds = []
+            for model in models:
+                out = model((noisy, xls_device))
+                preds.append(out[:, class_index].detach().cpu().numpy())
+
+            pred_mat = np.stack(preds, axis=0)
+            mean = pred_mat.mean(axis=0)
+            confidence = mean
+            std = pred_mat.std(axis=0)
+            true = labels_device[:, class_index].detach().cpu().numpy().astype(int)
+
+            all_probs.extend(mean.tolist())
+            all_confidence.extend(confidence.tolist())
+            all_stds.extend(std.tolist())
+            all_true.extend(true.tolist())
+
+    return (
+        np.asarray(all_true),
+        np.asarray(all_probs),
+        np.asarray(all_confidence),
+        np.asarray(all_stds),
+    )
+
+
+def _infer_with_noise_bayesian(
+    test_loader: torch.utils.data.DataLoader,
+    model: torch.nn.Module,
+    sigma: float,
+    intensity_range: float,
+    class_index: int,
+    mc_passes: int,
+) -> tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
+    device = next(model.parameters()).device
+    all_probs: list[float] = []
+    all_confidence: list[float] = []
+    all_stds: list[float] = []
+    all_true: list[int] = []
+
+    with torch.no_grad():
+        batch_iter = tqdm(
+            test_loader,
+            total=len(test_loader),
+            desc=f"bayesian sigma={sigma:g}",
+            unit="batch",
+            leave=False,
+        )
+        for mri, xls, labels, _ in batch_iter:
+            mri_device = mri.float().to(device)
+            xls_device = xls.float().to(device)
+            labels_device = labels.to(device)
+            noisy = _apply_scaled_noise(mri_device, sigma, intensity_range)
+            draws = []
+            for _ in range(mc_passes):
+                out = model((noisy, xls_device))
+                draws.append(out.detach().cpu().numpy())
+
+            draw_mat = np.stack(draws, axis=0)  # (mc_passes, batch, classes)
+            mean = draw_mat.mean(axis=0)[:, class_index]
+            confidence = mean
+            entropy_uncertainty = predictive_entropy(draw_mat)
+            true = labels_device[:, class_index].detach().cpu().numpy().astype(int)
+
+            all_probs.extend(mean.tolist())
+            all_confidence.extend(np.asarray(confidence, dtype=float).tolist())
+            all_stds.extend(np.asarray(entropy_uncertainty, dtype=float).tolist())
+            all_true.extend(true.tolist())
+
+    return (
+        np.asarray(all_true),
+        np.asarray(all_probs),
+        np.asarray(all_confidence),
+        np.asarray(all_stds),
+    )
+
+
+def run_noise_analysis(
+    config: dict[str, Any],
+    root_dir: Path,
+    backend: str,
+    output_dir: Path,
+    class_index: int,
+    noise_sigmas: list[float],
+    threshold: float,
+    calibration_bins: int,
+    bayesian_mc_passes: int,
+) -> dict[str, Any]:
+    noise_sigmas = _uniform_sigma_schedule(noise_sigmas)
+    test_loader = build_holdout_loader(
+        config=config,
+        root_dir=root_dir,
+        seed=int(config["data"]["seed"]),
+    )
+    # intensity_min, intensity_max, intensity_range = _compute_mri_intensity_range(
+    #     dataset
+    # )
+
+    # Just use a fixed intensity range for noise scaling since all that matters is that it's consistent
+    intensity_range = 10_000.0
+
+    out_plots_dir = plots_dir(output_dir)
+    examples_dir = out_plots_dir / "noise_examples"
+    examples_dir.mkdir(parents=True, exist_ok=True)
+
+    rows: list[dict[str, Any]] = []
+    if backend == "ensemble":
+        models = _load_ensemble_models(config)
+        example_rows: list[tuple[float, torch.Tensor]] = []
+        sigma_iter = tqdm(noise_sigmas, desc="Noise sweep (ensemble)", unit="sigma")
+        for sigma in sigma_iter:
+            sigma_iter.set_postfix_str(f"sigma={sigma:g}")
+            y_true, y_prob, y_confidence, y_std = _infer_with_noise_ensemble(
+                test_loader,
+                models,
+                sigma,
+                intensity_range,
+                class_index=class_index,
+            )
+            perf = performance_at_threshold(y_true, y_prob, threshold)
+            cal, _ = calibration_stats(y_true, y_prob, bins=calibration_bins)
+            rows.append(
+                {
+                    "uncertainty_metric": "std",
+                    "noise_factor": float(sigma),
+                    "accuracy": float(perf["accuracy"]),
+                    "f1": float(perf["f1"]),
+                    "mce": float(cal["mce"]),
+                    "mean_confidence": float(np.nanmean(y_confidence)),
+                    "mean_model_output_probability": float(np.nanmean(y_prob)),
+                    "mean_std": float(np.nanmean(y_std)),
+                    "mean_predictive_entropy": float("nan"),
+                    "mri_intensity_range": float(intensity_range),
+                }
+            )
+
+        with torch.no_grad():
+            sample = next(iter(test_loader))
+            original_mri = sample[0]
+            device = next(models[0].parameters()).device
+            original_device = original_mri.float().to(device)
+            for sigma in noise_sigmas:
+                noisy_mri = _apply_scaled_noise(original_device, sigma, intensity_range)
+                example_rows.append((float(sigma), noisy_mri.detach().cpu()))
+
+        save_noise_example_grid(
+            original_mri=original_mri,
+            noisy_by_sigma=example_rows,
+            output_path=examples_dir / f"{backend}_noise_examples.png",
+            title=f"{backend.title()} Noise Examples",
+            max_images=9,
+            n_rows=2,
+        )
+        save_clean_scan_image(
+            original_mri=original_mri,
+            output_path=examples_dir / f"{backend}_clean_scan_example.png",
+        )
+    elif backend == "bayesian":
+        model = _load_bayesian_model(config)
+        example_rows = []
+        sigma_iter = tqdm(noise_sigmas, desc="Noise sweep (bayesian)", unit="sigma")
+        for sigma in sigma_iter:
+            sigma_iter.set_postfix_str(f"sigma={sigma:g}")
+            y_true, y_prob, y_confidence, y_std = _infer_with_noise_bayesian(
+                test_loader,
+                model,
+                sigma,
+                intensity_range,
+                class_index=class_index,
+                mc_passes=bayesian_mc_passes,
+            )
+            perf = performance_at_threshold(y_true, y_prob, threshold)
+            cal, _ = calibration_stats(y_true, y_prob, bins=calibration_bins)
+            rows.append(
+                {
+                    "uncertainty_metric": "predictive_entropy",
+                    "noise_factor": float(sigma),
+                    "accuracy": float(perf["accuracy"]),
+                    "f1": float(perf["f1"]),
+                    "mce": float(cal["mce"]),
+                    "mean_confidence": float(np.nanmean(y_confidence)),
+                    "mean_model_output_probability": float(np.nanmean(y_prob)),
+                    # Compatibility field name retained for downstream code.
+                    "mean_std": float(np.nanmean(y_std)),
+                    "mean_predictive_entropy": float(np.nanmean(y_std)),
+                    "mri_intensity_range": float(intensity_range),
+                }
+            )
+
+        with torch.no_grad():
+            sample = next(iter(test_loader))
+            original_mri = sample[0]
+            device = next(model.parameters()).device
+            original_device = original_mri.float().to(device)
+            for sigma in noise_sigmas:
+                noisy_mri = _apply_scaled_noise(original_device, sigma, intensity_range)
+                example_rows.append((float(sigma), noisy_mri.detach().cpu()))
+
+        save_noise_example_grid(
+            original_mri=original_mri,
+            noisy_by_sigma=example_rows,
+            output_path=examples_dir / f"{backend}_noise_examples.png",
+            title=f"{backend.title()} Noise Examples",
+            max_images=9,
+            n_rows=2,
+        )
+        save_clean_scan_image(
+            original_mri=original_mri,
+            output_path=examples_dir / f"{backend}_clean_scan_example.png",
+        )
+    else:
+        raise ValueError(f"Unsupported backend for noise analysis: {backend}")
+
+    df = pd.DataFrame(rows).sort_values("noise_factor")
+    csv_path = output_dir / "noise_sensitivity.csv"
+    df.to_csv(csv_path, index=False)
+
+    accuracy_plot_path = out_plots_dir / "noise_sensitivity_accuracy.png"
+    f1_plot_path = out_plots_dir / "noise_sensitivity_f1.png"
+    pair_plot_path = out_plots_dir / "noise_sensitivity_accuracy_f1.png"
+    confidence_plot_path = out_plots_dir / "noise_confidence.png"
+    confidence_uncertainty_pair_path = (
+        out_plots_dir / "noise_confidence_predictive_uncertainty.png"
+        if backend == "bayesian"
+        else out_plots_dir / "noise_confidence_standard_deviation.png"
+    )
+    secondary_plot_name = (
+        "noise_predictive_uncertainty.png"
+        if backend == "bayesian"
+        else "noise_standard_deviation.png"
+    )
+    secondary_plot_path = out_plots_dir / secondary_plot_name
+
+    save_noise_metrics_plot(
+        x=df["noise_factor"],
+        y=df["accuracy"],
+        legend_label="Accuracy",
+        marker="o",
+        x_label="Gaussian Noise Factor",
+        y_label="Accuracy",
+        title=f"Accuracy vs Noise ({backend})",
+        output_path=accuracy_plot_path,
+        plot_key="noise_sensitivity_accuracy",
+    )
+    save_noise_metrics_plot(
+        x=df["noise_factor"],
+        y=df["f1"],
+        legend_label="F1",
+        marker="s",
+        x_label="Gaussian Noise Factor",
+        y_label="F1",
+        title=f"F1 vs Noise ({backend})",
+        output_path=f1_plot_path,
+        plot_key="noise_sensitivity_f1",
+    )
+    save_metric_pair_plot(
+        x=df["noise_factor"],
+        left_y=df["accuracy"],
+        right_y=df["f1"],
+        left_label="Accuracy",
+        right_label="F1",
+        x_label="Gaussian Noise Factor",
+        y_label="Accuracy/F1",
+        title=f"Accuracy and F1 vs Noise ({backend})",
+        output_path=pair_plot_path,
+        plot_key="noise_sensitivity_accuracy_f1",
+    )
+
+    secondary_label = (
+        "Predictive Uncertainty" if backend == "bayesian" else "Standard Deviation"
+    )
+
+    save_noise_metrics_plot(
+        x=df["noise_factor"],
+        y=df["mean_confidence"],
+        legend_label="Confidence",
+        marker="o",
+        x_label="Gaussian Noise Factor",
+        y_label="Confidence",
+        title=f"Confidence vs Noise ({backend})",
+        output_path=confidence_plot_path,
+        plot_key="noise_confidence",
+    )
+    save_noise_metrics_plot(
+        x=df["noise_factor"],
+        y=df["mean_std"],
+        legend_label=secondary_label,
+        marker="^",
+        x_label="Gaussian Noise Factor",
+        y_label=secondary_label,
+        title=f"{secondary_label} vs Noise ({backend})",
+        output_path=secondary_plot_path,
+        plot_key=(
+            "noise_predictive_uncertainty"
+            if backend == "bayesian"
+            else "noise_standard_deviation"
+        ),
+    )
+    save_metric_pair_plot(
+        x=df["noise_factor"],
+        left_y=df["mean_confidence"],
+        right_y=df["mean_std"],
+        left_label="Confidence",
+        right_label=secondary_label,
+        x_label="Gaussian Noise Factor",
+        y_label="Confidence/Uncertainty",
+        title=f"Confidence and {secondary_label} vs Noise ({backend})",
+        output_path=confidence_uncertainty_pair_path,
+        plot_key=(
+            "noise_confidence_predictive_uncertainty"
+            if backend == "bayesian"
+            else "noise_confidence_standard_deviation"
+        ),
+    )
+
+    out = {
+        "table": str(csv_path),
+        "plots": {
+            "accuracy": str(accuracy_plot_path),
+            "f1": str(f1_plot_path),
+            "accuracy_f1": str(pair_plot_path),
+            "confidence": str(confidence_plot_path),
+            (
+                "confidence_predictive_uncertainty"
+                if backend == "bayesian"
+                else "confidence_standard_deviation"
+            ): str(confidence_uncertainty_pair_path),
+            (
+                "predictive_uncertainty"
+                if backend == "bayesian"
+                else "standard_deviation"
+            ): str(secondary_plot_path),
+        },
+        "noise_factors": noise_sigmas,
+        "noise_sigmas": noise_sigmas,
+        # "mri_intensity_min": float(intensity_min),
+        # "mri_intensity_max": float(intensity_max),
+        "mri_intensity_range": float(intensity_range),
+    }
+    write_json(output_dir / "noise_summary.json", out)
+    return out

+ 172 - 0
analysis/noise_correlation.py

@@ -0,0 +1,172 @@
+# pyright: basic
+
+from __future__ import annotations
+
+from pathlib import Path
+from typing import Any
+
+import matplotlib.pyplot as plt
+import numpy as np
+import pandas as pd
+from scipy import stats
+
+from .plotting import annotate_stats_box, plots_dir
+from .runtime import write_json
+
+
+def _fit_line(metric: np.ndarray, accuracy: np.ndarray) -> dict[str, float]:
+    x = np.asarray(metric, dtype=float)
+    y = np.asarray(accuracy, dtype=float)
+    if len(y) < 3:
+        raise ValueError("Need at least 3 points for linear regression")
+
+    reg = stats.linregress(x, y)
+    return {
+        "intercept": float(reg.intercept),
+        "slope": float(reg.slope),
+        "r_value": float(reg.rvalue),
+        "p_value": float(reg.pvalue),
+        "stderr": float(reg.stderr),
+        "r_squared": float(reg.rvalue**2),
+    }
+
+
+def _metric_specs_for_backend(backend: str, df: pd.DataFrame) -> list[tuple[str, str]]:
+    specs: list[tuple[str, str]] = []
+    if "mean_confidence" in df.columns:
+        specs.append(("mean_confidence", "Confidence"))
+    elif "mean_model_output_probability" in df.columns:
+        # Backward-compatibility fallback for older CSV outputs.
+        specs.append(("mean_model_output_probability", "Confidence"))
+
+    if backend == "bayesian" and "mean_predictive_entropy" in df.columns:
+        specs.append(("mean_predictive_entropy", "Predictive Uncertainty"))
+    elif "mean_std" in df.columns:
+        specs.append(("mean_std", "Standard Deviation"))
+
+    return specs
+
+
+def run_noise_accuracy_uncertainty_analysis(
+    noise_df: pd.DataFrame,
+    backend: str,
+    output_dir: Path,
+) -> dict[str, Any]:
+    required_cols = ["noise_factor", "accuracy"]
+    missing = [c for c in required_cols if c not in noise_df.columns]
+    if missing:
+        raise KeyError(f"Missing required columns in noise dataframe: {missing}")
+
+    metric_specs = _metric_specs_for_backend(backend, noise_df)
+    if not metric_specs:
+        raise ValueError("No uncertainty metrics available for correlation analysis")
+
+    plot_path = plots_dir(output_dir) / "noise_accuracy_uncertainty_2d.png"
+    stats_rows: list[dict[str, Any]] = []
+
+    fig, axes = plt.subplots(1, len(metric_specs), figsize=(8 * len(metric_specs), 6))
+    if len(metric_specs) == 1:
+        axes = np.asarray([axes])
+
+    for idx, (metric_col, metric_label) in enumerate(metric_specs, start=1):
+        if metric_col not in noise_df.columns:
+            continue
+
+        df = noise_df[["noise_factor", metric_col, "accuracy"]].copy()
+        df = df.replace([np.inf, -np.inf], np.nan).dropna()
+        if len(df) < 4:
+            continue
+
+        noise = np.asarray(df["noise_factor"], dtype=float)
+        metric = np.asarray(df[metric_col], dtype=float)
+        accuracy = np.asarray(df["accuracy"], dtype=float)
+
+        pearson = stats.pearsonr(metric, accuracy)
+        fit = _fit_line(metric=metric, accuracy=accuracy)
+
+        stats_rows.append(
+            {
+                "backend": backend,
+                "metric_column": metric_col,
+                "metric_label": metric_label,
+                "n_points": int(len(df)),
+                "pearson_r_metric_vs_accuracy": float(pearson.statistic),
+                "p_value_metric_vs_accuracy": float(pearson.pvalue),
+                "regression_intercept": float(fit["intercept"]),
+                "regression_slope": float(fit["slope"]),
+                "regression_slope_stderr": float(fit["stderr"]),
+                "regression_r_squared": float(fit["r_squared"]),
+            }
+        )
+
+        ax = axes[idx - 1]
+        scatter = ax.scatter(
+            metric,
+            accuracy,
+            c=noise,
+            cmap="viridis",
+            s=42,
+            edgecolors="none",
+        )
+
+        x_line = np.linspace(float(np.min(metric)), float(np.max(metric)), num=200)
+        y_line = fit["intercept"] + fit["slope"] * x_line
+        ax.plot(x_line, y_line, color="#1f77b4", linewidth=2.0, label="Linear fit")
+
+        ax.set_xlabel(metric_label)
+        ax.set_ylabel("Accuracy")
+        ax.set_title(f"{backend.title()} - {metric_label} vs Accuracy")
+        ax.grid(True, alpha=0.3)
+        ax.legend()
+        annotate_stats_box(
+            ax,
+            lines=[
+                f"Pearson r = {pearson.statistic:.3f}",
+                f"p-value = {pearson.pvalue:.3g}",
+                f"R^2 = {fit['r_squared']:.3f}",
+            ],
+            location="upper left",
+        )
+        cbar = fig.colorbar(scatter, ax=ax)
+        cbar.set_label("Noise Factor")
+
+    fig.tight_layout()
+    plot_path.parent.mkdir(parents=True, exist_ok=True)
+    fig.savefig(plot_path)
+    plt.close(fig)
+
+    stats_df = pd.DataFrame(stats_rows)
+    stats_csv = output_dir / "noise_accuracy_uncertainty_stats.csv"
+    stats_df.to_csv(stats_csv, index=False)
+
+    summary_md = output_dir / "noise_accuracy_uncertainty_summary.md"
+    lines = [
+        f"# Noise Accuracy-Uncertainty Analysis ({backend})",
+        "",
+        "This analysis collapses the noise axis and fits a 2D linear relationship between uncertainty metric and accuracy.",
+        "Noise factor is encoded as point color in the plot.",
+        "",
+        "## Metrics",
+    ]
+    if stats_df.empty:
+        lines.append("- No valid metric rows were available for regression.")
+    else:
+        for _, row in stats_df.iterrows():
+            lines.extend(
+                [
+                    f"- {row['metric_label']}: Pearson r={row['pearson_r_metric_vs_accuracy']:.4f}, p={row['p_value_metric_vs_accuracy']:.4g}",
+                    f"  - Regression R^2: {row['regression_r_squared']:.4f}",
+                ]
+            )
+
+    summary_md.write_text("\n".join(lines), encoding="utf-8")
+
+    payload = {
+        "backend": backend,
+        "plot": str(plot_path),
+        "table": str(stats_csv),
+        "summary_markdown": str(summary_md),
+        "rows": int(len(stats_df)),
+    }
+    write_json(output_dir / "noise_accuracy_uncertainty_summary.json", payload)
+    return payload

+ 612 - 0
analysis/plotting.py

@@ -0,0 +1,612 @@
+# pyright: basic
+
+from __future__ import annotations
+
+from pathlib import Path
+
+import matplotlib.pyplot as plt
+import numpy as np
+import pandas as pd
+import torch
+from matplotlib.axes import Axes
+
+# Easily editable plot text overrides by plot key.
+# Example:
+# "performance_threshold": {
+#     "title": "Custom Title",
+#     "x_label": "Custom X",
+#     "y_label": "Custom Y",
+# }
+# Common keys:
+# - performance_threshold_accuracy
+# - performance_threshold_f1
+# - performance_uncertainty_cutoff_accuracy
+# - performance_uncertainty_cutoff_f1
+# - performance_uncertainty_percentile_cutoff_accuracy
+# - performance_uncertainty_percentile_cutoff_f1
+# - calibration_reliability
+# - noise_sensitivity_accuracy
+# - noise_sensitivity_f1
+# - noise_confidence
+# - noise_standard_deviation
+# - noise_predictive_uncertainty
+# - boxplot
+PLOT_TEXT_OVERRIDES: dict[str, dict[str, str]] = {}
+
+
+def _resolve_plot_text(
+    plot_key: str,
+    default_title: str,
+    default_x_label: str,
+    default_y_label: str,
+) -> tuple[str, str, str]:
+    override = PLOT_TEXT_OVERRIDES.get(plot_key, {})
+    return (
+        override.get("title", default_title),
+        override.get("x_label", default_x_label),
+        override.get("y_label", default_y_label),
+    )
+
+
+def annotate_stats_box(
+    ax: Axes,
+    lines: list[str],
+    location: str = "upper left",
+) -> None:
+    if not lines:
+        return
+
+    locations: dict[str, tuple[float, float, str, str]] = {
+        "upper left": (0.02, 0.98, "left", "top"),
+        "upper right": (0.98, 0.98, "right", "top"),
+        "lower left": (0.02, 0.02, "left", "bottom"),
+        "lower right": (0.98, 0.02, "right", "bottom"),
+    }
+    x, y, ha, va = locations.get(location, locations["upper left"])
+    ax.text(
+        x,
+        y,
+        "\n".join(lines),
+        transform=ax.transAxes,
+        ha=ha,
+        va=va,
+        fontsize=10,
+        bbox={
+            "boxstyle": "round,pad=0.35",
+            "facecolor": "white",
+            "edgecolor": "#555555",
+            "alpha": 0.9,
+        },
+    )
+
+
+def _plot_correct_incorrect_bars(
+    ax: Axes,
+    x_values: pd.Series,
+    n_correct: pd.Series,
+    n_incorrect: pd.Series,
+) -> None:
+    x = np.asarray(x_values, dtype=float)
+    correct = np.asarray(n_correct, dtype=float)
+    incorrect = np.asarray(n_incorrect, dtype=float)
+    if x.size == 0 or correct.size == 0 or incorrect.size == 0:
+        return
+
+    width = float(np.diff(np.sort(x)).min()) * 0.8 if x.size > 1 else 0.04
+    max_count = float(max(np.nanmax(correct), np.nanmax(incorrect), 1.0))
+
+    bars_ax = ax.twinx()
+    bars_ax.patch.set_alpha(0.0)
+    bars_ax.bar(
+        x,
+        correct,
+        width=width,
+        color="#2ca02c",
+        alpha=0.2,
+        label="correct",
+        zorder=0,
+        align="center",
+    )
+    bars_ax.bar(
+        x,
+        -incorrect,
+        width=width,
+        color="#d62728",
+        alpha=0.2,
+        label="incorrect",
+        zorder=0,
+        align="center",
+    )
+    bars_ax.axhline(0.0, color="gray", linewidth=0.8, alpha=0.4)
+    bars_ax.set_ylim(-1.15 * max_count, 1.15 * max_count)
+    bars_ax.set_yticks([])
+    bars_ax.grid(False)
+
+
+def save_coverage_bar_plot(
+    x_values: pd.Series | np.ndarray,
+    n_correct: pd.Series | np.ndarray,
+    n_incorrect: pd.Series | np.ndarray,
+    x_label: str,
+    title: str,
+    output_path: Path,
+) -> None:
+    """Save a standalone bar chart showing sample counts (correct vs incorrect)."""
+    x = np.asarray(x_values, dtype=float)
+    correct = np.asarray(n_correct, dtype=float)
+    incorrect = np.asarray(n_incorrect, dtype=float)
+    if x.size == 0 or correct.size == 0 or incorrect.size == 0:
+        return
+
+    width = float(np.diff(np.sort(x)).min()) * 0.8 if x.size > 1 else 0.04
+    max_count = float(max(np.nanmax(correct), np.nanmax(incorrect), 1.0))
+
+    fig, ax = plt.subplots(figsize=(10, 5))
+    ax.bar(
+        x,
+        correct,
+        width=width,
+        color="#2ca02c",
+        alpha=0.6,
+        label="correct",
+        align="center",
+    )
+    ax.bar(
+        x,
+        -incorrect,
+        width=width,
+        color="#d62728",
+        alpha=0.6,
+        label="incorrect",
+        align="center",
+    )
+    ax.axhline(0.0, color="gray", linewidth=0.8, alpha=0.4)
+    ax.set_ylim(-1.15 * max_count, 1.15 * max_count)
+    ax.set_xlabel(x_label)
+    ax.set_ylabel("Sample Count")
+    ax.set_title(title)
+    ax.legend()
+    ax.grid(True, alpha=0.3)
+    fig.tight_layout()
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    fig.savefig(output_path)
+    plt.close(fig)
+
+
+def plots_dir(output_dir: Path) -> Path:
+    plots = output_dir / "plots"
+    plots.mkdir(parents=True, exist_ok=True)
+    return plots
+
+
+def save_performance_threshold_plot(
+    df: pd.DataFrame,
+    backend: str,
+    output_path: Path,
+    metric_column: str,
+    metric_label: str,
+    plot_key: str,
+) -> None:
+    title, x_label, y_label = _resolve_plot_text(
+        plot_key=plot_key,
+        default_title=f"{metric_label} vs Decision Threshold ({backend})",
+        default_x_label="Decision Threshold",
+        default_y_label=metric_label,
+    )
+
+    n_correct = pd.to_numeric(df["tp"], errors="coerce") + pd.to_numeric(
+        df["tn"], errors="coerce"
+    )
+    n_incorrect = pd.to_numeric(df["fp"], errors="coerce") + pd.to_numeric(
+        df["fn"], errors="coerce"
+    )
+
+    fig, ax = plt.subplots(figsize=(10, 5))
+    ax.plot(df["threshold"], df[metric_column], label=metric_label, marker="o")
+    ax.set_xlabel(x_label)
+    ax.set_ylabel(y_label)
+    ax.set_title(title)
+    ax.grid(True, alpha=0.3)
+    ax.legend()
+    fig.tight_layout()
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    fig.savefig(output_path)
+    plt.close(fig)
+
+    # Generate separate coverage bar plot
+    coverage_path = output_path.parent / f"{output_path.stem}_coverage.png"
+    save_coverage_bar_plot(
+        x_values=df["threshold"],
+        n_correct=n_correct,
+        n_incorrect=n_incorrect,
+        x_label=x_label,
+        title=f"Sample Distribution vs Decision Threshold ({backend})",
+        output_path=coverage_path,
+    )
+
+
+def save_performance_threshold_pair_plot(
+    df: pd.DataFrame,
+    backend: str,
+    output_path: Path,
+    plot_key: str,
+) -> None:
+    title, x_label, _ = _resolve_plot_text(
+        plot_key=plot_key,
+        default_title=f"Accuracy and F1 vs Decision Threshold ({backend})",
+        default_x_label="Decision Threshold",
+        default_y_label="Accuracy/F1",
+    )
+
+    n_correct = pd.to_numeric(df["tp"], errors="coerce") + pd.to_numeric(
+        df["tn"], errors="coerce"
+    )
+    n_incorrect = pd.to_numeric(df["fp"], errors="coerce") + pd.to_numeric(
+        df["fn"], errors="coerce"
+    )
+
+    fig, axes = plt.subplots(1, 2, figsize=(14, 5), sharex=True)
+    for ax, metric_col, metric_label, marker in [
+        (axes[0], "accuracy", "Accuracy", "o"),
+        (axes[1], "f1", "F1", "s"),
+    ]:
+        ax.plot(df["threshold"], df[metric_col], label=metric_label, marker=marker)
+        ax.set_xlabel(x_label)
+        ax.set_ylabel(metric_label)
+        ax.set_title(f"{metric_label}")
+        ax.grid(True, alpha=0.3)
+        ax.legend()
+
+    fig.suptitle(title)
+    fig.tight_layout()
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    fig.savefig(output_path)
+    plt.close(fig)
+
+    # Generate separate coverage bar plot
+    coverage_path = output_path.parent / f"{output_path.stem}_coverage.png"
+    save_coverage_bar_plot(
+        x_values=df["threshold"],
+        n_correct=n_correct,
+        n_incorrect=n_incorrect,
+        x_label=x_label,
+        title=f"Sample Distribution vs Decision Threshold ({backend})",
+        output_path=coverage_path,
+    )
+
+
+def save_uncertainty_cutoff_plot(
+    cutoff_df: pd.DataFrame,
+    title_prefix: str,
+    x_label: str,
+    output_path: Path,
+    metric_column: str,
+    metric_label: str,
+    plot_key: str,
+) -> None:
+    title, x_label_final, y_label = _resolve_plot_text(
+        plot_key=plot_key,
+        default_title=f"{metric_label} vs {title_prefix}",
+        default_x_label=x_label,
+        default_y_label=metric_label,
+    )
+
+    fig, ax = plt.subplots(figsize=(10, 5))
+
+    for uncertainty_name, group in cutoff_df.groupby("uncertainty_type"):
+        g = group.sort_values("restriction_level")
+        ax.plot(
+            g["restriction_level"],
+            g[metric_column],
+            marker="o",
+            label=uncertainty_name,
+        )
+
+    ax.set_title(title)
+    ax.set_xlabel(x_label_final)
+    ax.set_ylabel(y_label)
+    ax.grid(True, alpha=0.3)
+    ax.legend()
+    fig.tight_layout()
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    fig.savefig(output_path)
+    plt.close(fig)
+
+    # Generate separate coverage bar plot
+    first_group = (
+        cutoff_df.sort_values(["uncertainty_type", "restriction_level"])
+        .groupby("uncertainty_type", as_index=False)
+        .head(1)
+    )
+    if not first_group.empty:
+        rep_name = str(first_group.iloc[0]["uncertainty_type"])
+        rep = cutoff_df[cutoff_df["uncertainty_type"] == rep_name].sort_values(
+            "restriction_level"
+        )
+        coverage_path = output_path.parent / f"{output_path.stem}_coverage.png"
+        save_coverage_bar_plot(
+            x_values=rep["restriction_level"],
+            n_correct=pd.to_numeric(rep["n_correct"], errors="coerce"),
+            n_incorrect=pd.to_numeric(rep["n_incorrect"], errors="coerce"),
+            x_label=x_label_final,
+            title=f"Sample Coverage vs {title_prefix}",
+            output_path=coverage_path,
+        )
+
+
+def save_uncertainty_cutoff_pair_plot(
+    cutoff_df: pd.DataFrame,
+    title_prefix: str,
+    x_label: str,
+    output_path: Path,
+    plot_key: str,
+) -> None:
+    title, x_label_final, _ = _resolve_plot_text(
+        plot_key=plot_key,
+        default_title=f"Accuracy and F1 vs {title_prefix}",
+        default_x_label=x_label,
+        default_y_label="Accuracy/F1",
+    )
+
+    fig, axes = plt.subplots(1, 2, figsize=(14, 5), sharex=True)
+
+    for uncertainty_name, group in cutoff_df.groupby("uncertainty_type"):
+        g = group.sort_values("restriction_level")
+        axes[0].plot(
+            g["restriction_level"], g["accuracy"], marker="o", label=uncertainty_name
+        )
+        axes[1].plot(
+            g["restriction_level"], g["f1"], marker="s", label=uncertainty_name
+        )
+
+    axes[0].set_title("Accuracy")
+    axes[1].set_title("F1")
+    for ax, metric_label in [(axes[0], "Accuracy"), (axes[1], "F1")]:
+        ax.set_xlabel(x_label_final)
+        ax.set_ylabel(metric_label)
+        ax.grid(True, alpha=0.3)
+        ax.legend()
+
+    fig.suptitle(title)
+    fig.tight_layout()
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    fig.savefig(output_path)
+    plt.close(fig)
+
+    # Generate separate coverage bar plot
+    first_group = (
+        cutoff_df.sort_values(["uncertainty_type", "restriction_level"])
+        .groupby("uncertainty_type", as_index=False)
+        .head(1)
+    )
+    if not first_group.empty:
+        rep_name = str(first_group.iloc[0]["uncertainty_type"])
+        rep = cutoff_df[cutoff_df["uncertainty_type"] == rep_name].sort_values(
+            "restriction_level"
+        )
+        coverage_path = output_path.parent / f"{output_path.stem}_coverage.png"
+        save_coverage_bar_plot(
+            x_values=rep["restriction_level"],
+            n_correct=pd.to_numeric(rep["n_correct"], errors="coerce"),
+            n_incorrect=pd.to_numeric(rep["n_incorrect"], errors="coerce"),
+            x_label=x_label_final,
+            title=f"Sample Coverage vs {title_prefix}",
+            output_path=coverage_path,
+        )
+
+
+def save_calibration_plot(per_bin: np.ndarray, backend: str, output_path: Path) -> None:
+    title, x_label, y_label = _resolve_plot_text(
+        plot_key="calibration_reliability",
+        default_title=f"Reliability Diagram ({backend})",
+        default_x_label="Mean Predicted Probability",
+        default_y_label="Empirical Fraction Positive",
+    )
+
+    fig, ax = plt.subplots(figsize=(6, 6))
+    valid = ~np.isnan(per_bin[:, 1])
+    ax.plot([0, 1], [0, 1], linestyle="--", color="gray", label="ideal")
+    ax.plot(per_bin[valid, 0], per_bin[valid, 1], marker="o", label=backend)
+    ax.set_xlabel(x_label)
+    ax.set_ylabel(y_label)
+    ax.set_title(title)
+    ax.legend()
+    ax.grid(True, alpha=0.3)
+    fig.tight_layout()
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    fig.savefig(output_path)
+    plt.close(fig)
+
+
+def save_boxplot(
+    data: list[np.ndarray],
+    tick_labels: list[str],
+    x_label: str,
+    y_label: str,
+    title: str,
+    output_path: Path,
+) -> None:
+    title_final, x_label_final, y_label_final = _resolve_plot_text(
+        plot_key="boxplot",
+        default_title=title,
+        default_x_label=x_label,
+        default_y_label=y_label,
+    )
+
+    labels_with_n = [
+        f"{label}\n(n={len(np.asarray(values, dtype=float))})"
+        for label, values in zip(tick_labels, data)
+    ]
+
+    fig, ax = plt.subplots(figsize=(9, 5))
+    ax.boxplot(data, tick_labels=labels_with_n)
+    ax.set_xlabel(x_label_final)
+    ax.set_ylabel(y_label_final)
+    ax.set_title(title_final)
+    ax.grid(True, axis="y", alpha=0.3)
+    fig.tight_layout()
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    fig.savefig(output_path)
+    plt.close(fig)
+
+
+def _central_slice(volume: torch.Tensor) -> np.ndarray:
+    tensor = volume.detach().cpu()
+    if tensor.ndim == 5:
+        tensor = tensor[0]
+    if tensor.ndim == 4:
+        tensor = tensor[0]
+    if tensor.ndim != 3:
+        raise ValueError(
+            f"Expected a 3D volume after squeezing batch/channel, got shape {tuple(tensor.shape)}"
+        )
+
+    center_index = tensor.shape[0] // 2
+    return tensor[center_index].numpy().astype(float)
+
+
+def _normalize_for_display(image: np.ndarray) -> np.ndarray:
+    low = float(np.percentile(image, 1))
+    high = float(np.percentile(image, 99))
+    if high <= low:
+        return np.zeros_like(image, dtype=float)
+    clipped = np.clip(image, low, high)
+    return (clipped - low) / (high - low)
+
+
+def save_noise_example_grid(
+    original_mri: torch.Tensor,
+    noisy_by_sigma: list[tuple[float, torch.Tensor]],
+    output_path: Path,
+    title: str,
+    max_images: int = 9,
+    n_rows: int = 2,
+) -> None:
+    if not noisy_by_sigma:
+        return
+
+    n_total = len(noisy_by_sigma)
+    target = max(1, min(int(max_images), n_total))
+    if target >= n_total:
+        selected = noisy_by_sigma
+    else:
+        # Sample indices across the full tested range, including first and last.
+        raw_idx = np.linspace(0, n_total - 1, num=target)
+        idx = np.round(raw_idx).astype(int)
+        selected_indices = sorted(set(idx.tolist()))
+        if len(selected_indices) < target:
+            existing = set(selected_indices)
+            for i in range(n_total):
+                if i in existing:
+                    continue
+                selected_indices.append(i)
+                if len(selected_indices) >= target:
+                    break
+            selected_indices = sorted(selected_indices)
+
+        selected = [noisy_by_sigma[i] for i in selected_indices]
+
+    n_images = len(selected)
+    n_rows = max(1, int(n_rows))
+    n_cols = int(np.ceil(n_images / n_rows))
+
+    fig, axes = plt.subplots(n_rows, n_cols, figsize=(3.8 * n_cols, 3.2 * n_rows))
+    axes_flat = np.atleast_1d(axes).reshape(-1)
+
+    for idx, (sigma, noisy_tensor) in enumerate(selected):
+        ax = axes_flat[idx]
+        noisy_slice = _normalize_for_display(_central_slice(noisy_tensor))
+        ax.imshow(noisy_slice, cmap="gray")
+        ax.set_title(f"Noise factor={sigma:g}")
+        ax.axis("off")
+
+    for idx in range(n_images, len(axes_flat)):
+        axes_flat[idx].axis("off")
+
+    fig.suptitle(title)
+    fig.tight_layout()
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    fig.savefig(output_path)
+    plt.close(fig)
+
+
+def save_clean_scan_image(
+    original_mri: torch.Tensor,
+    output_path: Path,
+) -> None:
+    image = _normalize_for_display(_central_slice(original_mri))
+    fig, ax = plt.subplots(figsize=(4, 4))
+    ax.imshow(image, cmap="gray")
+    ax.axis("off")
+    fig.subplots_adjust(left=0, right=1, top=1, bottom=0)
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    fig.savefig(output_path, bbox_inches="tight", pad_inches=0)
+    plt.close(fig)
+
+
+def save_noise_metrics_plot(
+    x: pd.Series,
+    y: pd.Series,
+    legend_label: str,
+    marker: str,
+    x_label: str,
+    y_label: str,
+    title: str,
+    output_path: Path,
+    plot_key: str,
+) -> None:
+    title_final, x_label_final, y_label_final = _resolve_plot_text(
+        plot_key=plot_key,
+        default_title=title,
+        default_x_label=x_label,
+        default_y_label=y_label,
+    )
+
+    fig, ax = plt.subplots(figsize=(10, 5))
+    ax.plot(x, y, marker=marker, label=legend_label)
+    ax.set_xlabel(x_label_final)
+    ax.set_ylabel(y_label_final)
+    ax.set_title(title_final)
+    ax.grid(True, alpha=0.3)
+    ax.legend()
+    fig.tight_layout()
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    fig.savefig(output_path)
+    plt.close(fig)
+
+
+def save_metric_pair_plot(
+    x: pd.Series,
+    left_y: pd.Series,
+    right_y: pd.Series,
+    left_label: str,
+    right_label: str,
+    x_label: str,
+    y_label: str,
+    title: str,
+    output_path: Path,
+    plot_key: str,
+) -> None:
+    title_final, x_label_final, y_label_final = _resolve_plot_text(
+        plot_key=plot_key,
+        default_title=title,
+        default_x_label=x_label,
+        default_y_label=y_label,
+    )
+
+    fig, axes = plt.subplots(1, 2, figsize=(14, 5), sharex=True)
+    axes[0].plot(x, left_y, marker="o", label=left_label)
+    axes[1].plot(x, right_y, marker="s", label=right_label)
+
+    for ax, name in [(axes[0], left_label), (axes[1], right_label)]:
+        ax.set_xlabel(x_label_final)
+        ax.set_ylabel(name)
+        ax.set_title(name)
+        ax.grid(True, alpha=0.3)
+        ax.legend()
+
+    fig.suptitle(title_final)
+    fig.tight_layout()
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    fig.savefig(output_path)
+    plt.close(fig)

+ 327 - 0
analysis/regenerate_plots.py

@@ -0,0 +1,327 @@
+# pyright: basic
+
+"""Regenerate analysis plots from existing computed data (CSV files).
+
+This script regenerates all plots from previously computed analysis results
+without re-running the full analysis pipeline. Useful when making changes
+to plotting parameters or fixing visualizations.
+
+Usage: Run from the project root (alnn_rewrite directory):
+    python analysis/regenerate_plots.py /path/to/run_directory/backend_name
+
+Example:
+    python analysis/regenerate_plots.py analysis_output/run_20260428_120000/ensemble
+"""
+
+from __future__ import annotations
+
+import argparse
+import sys
+from pathlib import Path
+from typing import Any
+
+import numpy as np
+import pandas as pd
+
+# Add parent directory to path for imports
+sys.path.insert(0, str(Path(__file__).parent.parent))
+
+from analysis.analysis_modules import _uncertainty_cutoff_analysis
+from analysis.defaults import (
+    DEFAULT_CALIBRATION_BINS,
+    DEFAULT_DECISION_THRESHOLD,
+    uncertainty_cutoff_percentiles,
+)
+from analysis.plotting import (
+    plots_dir,
+    save_calibration_plot,
+    save_performance_threshold_pair_plot,
+    save_performance_threshold_plot,
+    save_uncertainty_cutoff_pair_plot,
+    save_uncertainty_cutoff_plot,
+)
+from analysis.runtime import write_json
+
+
+def _plot_description(filename: str) -> str:
+    descriptions = {
+        "performance_threshold_accuracy.png": "Accuracy as the decision threshold varies.",
+        "performance_threshold_f1.png": "F1 score as the decision threshold varies.",
+        "performance_threshold_accuracy_f1.png": "Accuracy and F1 shown side-by-side as the decision threshold varies.",
+        "performance_uncertainty_cutoff_accuracy.png": "Accuracy while progressively restricting to higher-confidence and uncertainty-metric subsets.",
+        "performance_uncertainty_cutoff_f1.png": "F1 score while progressively restricting to higher-confidence and uncertainty-metric subsets.",
+        "performance_uncertainty_cutoff_accuracy_f1.png": "Accuracy and F1 shown side-by-side across uncertainty-cutoff restriction levels.",
+        "performance_uncertainty_percentile_cutoff_accuracy.png": "Accuracy from least to most restricted percentile-wise subset selection.",
+        "performance_uncertainty_percentile_cutoff_f1.png": "F1 score from least to most restricted percentile-wise subset selection.",
+        "performance_uncertainty_percentile_cutoff_accuracy_f1.png": "Accuracy and F1 shown side-by-side across percentile-floor restriction levels.",
+        "calibration_reliability.png": "Reliability diagram comparing predicted probability to empirical outcome frequency.",
+        "performance_threshold_accuracy_coverage.png": "Sample distribution (correct vs incorrect) across decision thresholds.",
+        "performance_threshold_f1_coverage.png": "Sample distribution (correct vs incorrect) across decision thresholds.",
+        "performance_threshold_accuracy_f1_coverage.png": "Sample distribution (correct vs incorrect) across decision thresholds.",
+        "performance_uncertainty_cutoff_accuracy_coverage.png": "Sample coverage breakdown across restriction levels.",
+        "performance_uncertainty_cutoff_f1_coverage.png": "Sample coverage breakdown across restriction levels.",
+        "performance_uncertainty_cutoff_accuracy_f1_coverage.png": "Sample coverage breakdown across restriction levels.",
+        "performance_uncertainty_percentile_cutoff_accuracy_coverage.png": "Sample coverage breakdown as percentile floor increases.",
+        "performance_uncertainty_percentile_cutoff_f1_coverage.png": "Sample coverage breakdown as percentile floor increases.",
+        "performance_uncertainty_percentile_cutoff_accuracy_f1_coverage.png": "Sample coverage breakdown as percentile floor increases.",
+    }
+    return descriptions.get(filename, "Generated analysis plot.")
+
+
+def _write_backend_plot_report(backend: str, out_dir: Path) -> Path:
+    plots = out_dir / "plots"
+    images = sorted(plots.rglob("*.png")) if plots.exists() else []
+
+    report_path = out_dir / "plots_report.md"
+    lines = [
+        f"# {backend.title()} Analysis Plot Report (Regenerated)",
+        "",
+        "This document lists regenerated analysis plots with brief descriptions.",
+        "",
+    ]
+    if not images:
+        lines.append("No plot images were found for this backend run.")
+    else:
+        for image_path in images:
+            rel = image_path.relative_to(out_dir).as_posix()
+            title = image_path.stem.replace("_", " ").title()
+            lines.append(f"## {title}")
+            lines.append(_plot_description(image_path.name))
+            lines.append("")
+            lines.append(f"![{title}]({rel})")
+            lines.append("")
+
+    report_path.write_text("\n".join(lines), encoding="utf-8")
+    return report_path
+
+
+def regenerate_performance_plots(backend_dir: Path) -> dict[str, Any]:
+    """Regenerate performance threshold plots from existing CSV."""
+    perf_csv = backend_dir / "performance_threshold_sweep.csv"
+    if not perf_csv.exists():
+        return {"status": "skipped", "reason": "no performance_threshold_sweep.csv"}
+
+    df = pd.read_csv(perf_csv)
+    backend = backend_dir.name if backend_dir.name != "plots" else "ensemble"
+
+    # Get backend name from parent directory name if not found
+    if backend_dir.parent.name not in ["ensemble", "bayesian"]:
+        parent_name = backend_dir.name
+        if parent_name in {"ensemble", "bayesian"}:
+            backend = parent_name
+
+    accuracy_plot_path = plots_dir(backend_dir) / "performance_threshold_accuracy.png"
+    f1_plot_path = plots_dir(backend_dir) / "performance_threshold_f1.png"
+    pair_plot_path = plots_dir(backend_dir) / "performance_threshold_accuracy_f1.png"
+
+    save_performance_threshold_plot(
+        df=df,
+        backend=backend,
+        output_path=accuracy_plot_path,
+        metric_column="accuracy",
+        metric_label="Accuracy",
+        plot_key="performance_threshold_accuracy",
+    )
+    save_performance_threshold_plot(
+        df=df,
+        backend=backend,
+        output_path=f1_plot_path,
+        metric_column="f1",
+        metric_label="F1",
+        plot_key="performance_threshold_f1",
+    )
+    save_performance_threshold_pair_plot(
+        df=df,
+        backend=backend,
+        output_path=pair_plot_path,
+        plot_key="performance_threshold_accuracy_f1",
+    )
+
+    return {
+        "status": "regenerated",
+        "performance_threshold_accuracy": str(accuracy_plot_path),
+        "performance_threshold_f1": str(f1_plot_path),
+        "performance_threshold_accuracy_f1": str(pair_plot_path),
+    }
+
+
+def regenerate_uncertainty_cutoff_plots(backend_dir: Path) -> dict[str, Any]:
+    """Regenerate uncertainty cutoff plots from existing CSV."""
+    cutoff_csv = backend_dir / "performance_uncertainty_cutoff.csv"
+    percentile_csv = backend_dir / "performance_uncertainty_percentile_cutoff.csv"
+
+    results = {"status": "skipped", "reason": "no cutoff CSV files found"}
+
+    if cutoff_csv.exists():
+        cutoff_df = pd.read_csv(cutoff_csv)
+        results["status"] = "regenerated"
+
+        # Create plots by uncertainty type
+        for uncertainty_name in sorted(pd.unique(cutoff_df["uncertainty_type"])):
+            sub_df = cutoff_df[cutoff_df["uncertainty_type"] == uncertainty_name].copy()
+            slug = uncertainty_name.lower().replace(" ", "_")
+
+            sub_accuracy_plot_path = (
+                plots_dir(backend_dir)
+                / f"performance_uncertainty_cutoff_{slug}_accuracy.png"
+            )
+            sub_f1_plot_path = (
+                plots_dir(backend_dir) / f"performance_uncertainty_cutoff_{slug}_f1.png"
+            )
+            sub_pair_plot_path = (
+                plots_dir(backend_dir)
+                / f"performance_uncertainty_cutoff_{slug}_accuracy_f1.png"
+            )
+
+            save_uncertainty_cutoff_plot(
+                cutoff_df=sub_df,
+                title_prefix="Model Output / Uncertainty Cutoff Percentile",
+                x_label="Restriction Level (0 = all samples, 100 = most restricted subset)",
+                output_path=sub_accuracy_plot_path,
+                metric_column="accuracy",
+                metric_label="Accuracy",
+                plot_key="performance_uncertainty_cutoff_accuracy",
+            )
+            save_uncertainty_cutoff_plot(
+                cutoff_df=sub_df,
+                title_prefix="Model Output / Uncertainty Cutoff Percentile",
+                x_label="Restriction Level (0 = all samples, 100 = most restricted subset)",
+                output_path=sub_f1_plot_path,
+                metric_column="f1",
+                metric_label="F1",
+                plot_key="performance_uncertainty_cutoff_f1",
+            )
+            save_uncertainty_cutoff_pair_plot(
+                cutoff_df=sub_df,
+                title_prefix="Model Output / Uncertainty Cutoff Percentile",
+                x_label="Restriction Level (0 = all samples, 100 = most restricted subset)",
+                output_path=sub_pair_plot_path,
+                plot_key="performance_uncertainty_cutoff_accuracy_f1",
+            )
+
+    if percentile_csv.exists():
+        percentile_df = pd.read_csv(percentile_csv)
+        results["status"] = "regenerated"
+
+        # Create plots by uncertainty type
+        for uncertainty_name in sorted(pd.unique(percentile_df["uncertainty_type"])):
+            sub_df = percentile_df[
+                percentile_df["uncertainty_type"] == uncertainty_name
+            ].copy()
+            slug = uncertainty_name.lower().replace(" ", "_")
+
+            sub_accuracy_plot_path = (
+                plots_dir(backend_dir)
+                / f"performance_uncertainty_percentile_cutoff_{slug}_accuracy.png"
+            )
+            sub_f1_plot_path = (
+                plots_dir(backend_dir)
+                / f"performance_uncertainty_percentile_cutoff_{slug}_f1.png"
+            )
+            sub_pair_plot_path = (
+                plots_dir(backend_dir)
+                / f"performance_uncertainty_percentile_cutoff_{slug}_accuracy_f1.png"
+            )
+
+            save_uncertainty_cutoff_plot(
+                cutoff_df=sub_df,
+                title_prefix="Model Output / Uncertainty Percentile Floor",
+                x_label="Percentile Floor (0 = all samples, 100 = top percentile subset)",
+                output_path=sub_accuracy_plot_path,
+                metric_column="accuracy",
+                metric_label="Accuracy",
+                plot_key="performance_uncertainty_percentile_cutoff_accuracy",
+            )
+            save_uncertainty_cutoff_plot(
+                cutoff_df=sub_df,
+                title_prefix="Model Output / Uncertainty Percentile Floor",
+                x_label="Percentile Floor (0 = all samples, 100 = top percentile subset)",
+                output_path=sub_f1_plot_path,
+                metric_column="f1",
+                metric_label="F1",
+                plot_key="performance_uncertainty_percentile_cutoff_f1",
+            )
+            save_uncertainty_cutoff_pair_plot(
+                cutoff_df=sub_df,
+                title_prefix="Model Output / Uncertainty Percentile Floor",
+                x_label="Percentile Floor (0 = all samples, 100 = top percentile subset)",
+                output_path=sub_pair_plot_path,
+                plot_key="performance_uncertainty_percentile_cutoff_accuracy_f1",
+            )
+
+    return results
+
+
+def regenerate_calibration_plots(backend_dir: Path) -> dict[str, Any]:
+    """Regenerate calibration plots from existing calibration data."""
+    calib_path = backend_dir / "calibration_per_bin.npy"
+    if not calib_path.exists():
+        return {"status": "skipped", "reason": "no calibration_per_bin.npy"}
+
+    per_bin = np.load(calib_path)
+    backend = backend_dir.name if backend_dir.name != "plots" else "ensemble"
+
+    # Get backend name from parent directory name if not found
+    if backend_dir.parent.name not in ["ensemble", "bayesian"]:
+        parent_name = backend_dir.name
+        if parent_name in {"ensemble", "bayesian"}:
+            backend = parent_name
+
+    plot_path = plots_dir(backend_dir) / "calibration_reliability.png"
+    save_calibration_plot(per_bin=per_bin, backend=backend, output_path=plot_path)
+
+    return {
+        "status": "regenerated",
+        "calibration_reliability": str(plot_path),
+    }
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(
+        description="Regenerate analysis plots from existing computed data CSV files."
+    )
+    parser.add_argument(
+        "backend_dir",
+        type=Path,
+        help="Path to backend-specific analysis output directory "
+        "(e.g., analysis_output/run_xxx/ensemble)",
+    )
+
+    args = parser.parse_args()
+    backend_dir = args.backend_dir.resolve()
+
+    if not backend_dir.exists():
+        print(
+            f"Error: Backend directory does not exist: {backend_dir}", file=sys.stderr
+        )
+        sys.exit(1)
+
+    print(f"Regenerating plots from: {backend_dir}")
+
+    results: dict[str, Any] = {
+        "backend_dir": str(backend_dir),
+        "performance": regenerate_performance_plots(backend_dir),
+        "uncertainty_cutoff": regenerate_uncertainty_cutoff_plots(backend_dir),
+        "calibration": regenerate_calibration_plots(backend_dir),
+    }
+
+    # Write updated report
+    report_path = _write_backend_plot_report(
+        backend=backend_dir.name, out_dir=backend_dir
+    )
+    results["plots_report"] = str(report_path)
+
+    print(f"\nPlot regeneration complete!")
+    print(f"Results summary:")
+    print(f"  Performance plots: {results['performance'].get('status', 'unknown')}")
+    print(
+        f"  Uncertainty cutoff plots: {results['uncertainty_cutoff'].get('status', 'unknown')}"
+    )
+    print(f"  Calibration plots: {results['calibration'].get('status', 'unknown')}")
+    print(f"  Report written to: {report_path}")
+
+    write_json(backend_dir / "plot_regeneration_log.json", results)
+
+
+if __name__ == "__main__":
+    main()

+ 51 - 0
analysis/runtime.py

@@ -0,0 +1,51 @@
+from __future__ import annotations
+
+from dataclasses import dataclass
+from datetime import datetime
+import json
+from pathlib import Path
+import tomllib
+from typing import Any
+
+
+@dataclass(frozen=True)
+class RuntimePaths:
+    root_dir: Path
+    analysis_dir: Path
+    output_root: Path
+    run_dir: Path
+
+
+def load_config(root_dir: Path) -> dict[str, Any]:
+    config_path = root_dir / "config.toml"
+    with config_path.open("rb") as f:
+        return tomllib.load(f)
+
+
+def init_runtime_paths(analysis_dir: Path, run_name: str | None = None) -> RuntimePaths:
+    root_dir = analysis_dir.parent
+    output_root = root_dir / "analysis_output"
+    output_root.mkdir(parents=True, exist_ok=True)
+
+    safe_run_name = run_name or datetime.now().strftime("run_%Y%m%d_%H%M%S")
+    run_dir = output_root / safe_run_name
+    run_dir.mkdir(parents=True, exist_ok=True)
+
+    return RuntimePaths(
+        root_dir=root_dir,
+        analysis_dir=analysis_dir,
+        output_root=output_root,
+        run_dir=run_dir,
+    )
+
+
+def backend_dir(paths: RuntimePaths, backend: str) -> Path:
+    out = paths.run_dir / backend
+    out.mkdir(parents=True, exist_ok=True)
+    return out
+
+
+def write_json(path: Path, payload: dict[str, Any]) -> None:
+    path.parent.mkdir(parents=True, exist_ok=True)
+    with path.open("w", encoding="utf-8") as f:
+        json.dump(payload, f, indent=2)

+ 16 - 0
analysis/uncertainty.py

@@ -0,0 +1,16 @@
+# pyright: basic
+
+from __future__ import annotations
+
+import numpy as np
+
+
+def confidence_certainty(y_prob: np.ndarray) -> np.ndarray:
+    # 2 * |p - 0.5| maps 0 -> very uncertain and 1 -> very certain.
+    return 2.0 * np.abs(np.asarray(y_prob, dtype=float) - 0.5)
+
+
+def confidence_uncertainty(y_prob: np.ndarray) -> np.ndarray:
+    # Normalized uncertainty in [0, 1]: 1 at p=0.5 and 0 at p in {0, 1}.
+    uncertainty = 1.0 - confidence_certainty(np.asarray(y_prob, dtype=float))
+    return np.clip(uncertainty, 0.0, 1.0)

+ 1 - 1
run_overnight_training.py

@@ -111,7 +111,7 @@ def main() -> int:
     log_dir = (
         pl.Path(args.log_dir).resolve()
         if args.log_dir is not None
-        else workdir / "logs"
+        else                 / "logs"
     )
     run_stamp = datetime.now().strftime("%Y%m%d_%H%M%S")
     run_log_dir = log_dir / f"overnight_{run_stamp}"