# Model Evaluation Code

## Description

This folder should contain all the necessary code for the evaluation of the Bayeisan model and the Deep Ensemble. It should generate and save graphs and statistics for this. The included analyses should include

- Performance information (i.e. basic information on accuracy, number correct, number incorrect, F1 score, etc.)
- Some basic metrics for uncertainty (ECE, MCE)
- Physican Confidence Analysis (graphing uncertainty vs physican confidence)
- Longitudinal Analysis (graphing uncertainty on patients who remained stable CN, stable AD, or switched from CN to AD)
- Noise introduction analysis (graphing uncertainty on normal image and images with increasing levels of Gaussian noise applied)

"Uncertainty" should be taken to mean EITHER confidence or standard deviation for the ensemble and the bayesian network (i.e. the raw outputs distance from 0.5 or either the stdev from the Bayesian or the stdev of all the model outputs). Analyses should be evaluated with both.

The code in the senior_research_thesis and the prebious alnn_rewrite/analysis folders should be consulted. The models should be loaded from the currently in-place config.toml

## Implementation Status

The modular implementation now lives entirely in this folder.

- All new source code is under `alnn_rewrite/analysis`
- All generated artifacts are written under `alnn_rewrite/analysis_output`

If a backend does not already have `model_evaluation_results.nc`, the pipeline now automatically evaluates that backend on the holdout datasets (validation + test) first, saves the generated netCDF into that backend's model output directory, and then runs the analyses.

Uncertainty analyses are now run using both:

- Confidence-based certainty: `2 * |p - 0.5|` where `0` means very uncertain and `1` means very certain
- Confidence-based uncertainty: `1 - 2 * |p - 0.5|` so larger values mean more uncertainty, matching std-based plots
- Secondary uncertainty metric: ensemble uses std across models; bayesian uses predictive entropy across MC samples (`bayesian_torch.utils.util.predictive_entropy`)

### Current Modules

- `evaluate_models.py`: Orchestrator CLI for running selected analyses across ensemble and bayesian backends.
- `runtime.py`: Runtime paths and JSON helpers.
- `data_access.py`: netCDF loading, class probability extraction, and clinical table access.
- `metrics.py`: Shared performance and calibration metrics.
- `analysis_modules.py`: Performance, calibration, physician confidence, and longitudinal analyses.
- `noise_analysis.py`: Evaluation-time Gaussian noise sensitivity analysis.

### How To Run

From `alnn_rewrite`:

```bash
python -m analysis.evaluate_models
```

Useful options:

```bash
python -m analysis.evaluate_models \
	--backend ensemble bayesian \
	--run-name first_modular_run \
	--positive-class-index 0 \
	--calibration-bins 10 \
	--noise-sigmas 0.0 0.01 0.03 0.05 0.1
```

If you want to skip noise analysis while validating the pipeline:

```bash
python -m analysis.evaluate_models --skip-noise
```

### Output Layout

Each run creates a dedicated directory:

```text
alnn_rewrite/analysis_output/
	run_YYYYMMDD_HHMMSS/
		run_manifest.json
		ensemble/
			backend_summary.json
			performance_threshold_sweep.csv
			performance_threshold_sweep.png
			calibration_bins.csv
			calibration_reliability.png
			physician_grouped_metrics.csv
			physician_confidence_grouped_metrics.csv
			physician_std_grouped_metrics.csv
			physician_confidence_boxplot.png
			physician_std_boxplot.png
			longitudinal_patient_summary.csv
			longitudinal_cohort_summary.csv
			longitudinal_uncertainty_by_cohort.csv
			longitudinal_confidence_patient_summary.csv
			longitudinal_std_patient_summary.csv
			longitudinal_confidence_cohort_summary.csv
			longitudinal_std_cohort_summary.csv
			longitudinal_cohort_confidence.png
			longitudinal_cohort_std.png
			noise_sensitivity.csv
			noise_sensitivity.png
			noise_uncertainty.png
			noise_confidence_certainty.png
			noise_examples/
				ensemble_noise_examples.png
				bayesian_noise_examples.png
		bayesian/
			... (same structure)
```

The default noise schedule now includes noisier settings beyond the earlier small-sigma cases, so the saved example images show a clearer progression from lightly noised to heavily noised volumes.

For the noise analysis, the uncertainty plot uses the confidence metric in its uncertainty orientation, so higher values always mean more uncertainty. A separate certainty plot is also saved for direct inspection of `2 * |p - 0.5|`.

Noise is now scaled by each sample's MRI intensity standard deviation, so sigma is dimensionless and interpretable across raw intensity ranges. In this setup, `sigma=1.0` means the injected noise standard deviation is approximately equal to the image's own standard deviation.