# Analysis Overview

This document summarizes what each analysis in this folder does and what it produces.

## Runtime Defaults

- Analysis defaults are centralized in `analysis/defaults.py`.
- The orchestrator CLI in `analysis/evaluate_models.py` is intentionally minimal and focuses on operational controls (`--backend`, `--run-name`, `--skip-noise`).
- Standalone operational modes include `--longitudinal-breakdown-only`, `--noise-correlation-only`, and `--dataset-summary-only`.
- Threshold grids, uncertainty percentile grids, noise-factor grids, calibration bins, class index, decision threshold, and bayesian MC passes are sourced from `analysis/defaults.py`.

## Shared Utilities

- Shared loader/split logic is centralized in `analysis/data_pipeline.py`.
- All plotting code is centralized in `analysis/plotting.py` for easier inspection and maintenance.

## 1. Performance Threshold Sweep

- Purpose: Measure classification performance as decision threshold changes.
- Inputs: Ground-truth labels and predicted probabilities.
- Method: Evaluate metrics across an evenly spaced threshold grid.
- Main outputs:
  - `performance_threshold_sweep.csv`
  - `plots/performance_threshold_accuracy.png`
  - `plots/performance_threshold_f1.png`

## 2. Uncertainty Cutoff (Raw-Value)

- Purpose: Evaluate performance on subsets with uncertainty below percentile-derived cutoff values.
- Inputs: Uncertainty arrays (confidence-derived uncertainty and backend-specific uncertainty).
- Method:
  - Build evenly spaced percentile points.
  - Convert each percentile to a raw uncertainty cutoff value.
  - Keep samples where uncertainty is less than or equal to that cutoff.
  - Compute accuracy and F1 for each retained subset.
- Main outputs:
  - `performance_uncertainty_cutoff.csv`
  - `plots/performance_uncertainty_cutoff_accuracy.png`
  - `plots/performance_uncertainty_cutoff_f1.png`

## 3. Uncertainty Cutoff (Percentile-Ranked)

- Purpose: Evaluate performance from all samples toward only the lowest-uncertainty samples.
- Inputs: Same uncertainty arrays as above.
- Method:
  - Use an evenly spaced percentile grid.
  - Keep samples where uncertainty is less than or equal to the selected percentile cutoff.
  - Plot from least restricted on the left (all samples) to most restricted on the right (lowest-uncertainty subset only).
- Main outputs:
  - `performance_uncertainty_percentile_cutoff.csv`
  - `plots/performance_uncertainty_percentile_cutoff_accuracy.png`
  - `plots/performance_uncertainty_percentile_cutoff_f1.png`

## 4. Calibration Analysis

- Purpose: Quantify probability calibration quality.
- Inputs: Ground-truth labels and predicted probabilities.
- Method:
  - Reliability binning with configurable bin count.
  - Compute MCE and Brier score.
- Main outputs:
  - `calibration_bins.csv`
  - `plots/calibration_reliability.png`

## 5. Physician Confidence Comparison

- Purpose: Compare model uncertainty with physician confidence ratings.
- Inputs: Evaluation outputs plus clinical table (Image Data ID + physician confidence column).
- Method:
  - Merge by image ID.
  - Group metrics by physician confidence level.
  - Plot distributions per rating group.
- Main outputs include grouped summary CSV files and boxplots for confidence and standard deviation (ensemble) or predictive uncertainty (bayesian).

## 6. Longitudinal Stability Analysis

- Purpose: Examine uncertainty patterns across stable and transitioning patient trajectories.
- Inputs: Evaluation outputs and clinical timeline information.
- Method:
  - Build patient-level trajectories.
  - Group by clinical cohort dynamics.
  - Compare uncertainty summaries across cohorts.
- Main outputs include patient/cohort summary CSV files and cohort uncertainty plots.

## 7. Noise Sensitivity Analysis

- Purpose: Test robustness and uncertainty behavior under synthetic Gaussian noise.
- Inputs: Holdout data loader, model backend, noise factor schedule, threshold, calibration bins.
- Method:
  - Use an evenly spaced noise factor schedule.
  - Add Gaussian noise scaled by a fixed intensity-range factor.
  - Recompute performance and calibration at each sigma.
  - Save visual examples of noised images.
- Main outputs:
  - `noise_sensitivity.csv`
  - `plots/noise_sensitivity_accuracy.png`
  - `plots/noise_sensitivity_f1.png`
  - `plots/noise_confidence.png`
  - `plots/noise_standard_deviation.png` (ensemble) / `plots/noise_predictive_uncertainty.png` (bayesian)
  - `plots/noise_examples/*_noise_examples.png`
  - `plots/noise_examples/*_clean_scan_example.png`

## 8. Dataset Composition Summary

- Purpose: Report dataset composition without rerunning full evaluation analyses.
- Inputs: Raw dataset files and configured train/validation/test split ratios.
- Method:
  - Rebuild dataset and split assignment using the configured split seed.
  - Count total images and positive/negative labels overall.
  - Count train/validation/test images and per-split class balance.
  - Compute both overall and split-level percentages.
- Main outputs:
  - `dataset_summary.md`
- CLI:
  - `python -m analysis.evaluate_models --dataset-summary-only`

## Plot Report

- Each backend output now includes `plots_report.md`.
- The report embeds all generated plot images and includes a short description per plot.

## Notes on Even Spacing

The pipeline uses evenly spaced grids for sampled x-axis points in threshold and uncertainty-cutoff performance plots, and for the sigma schedule used by noise sensitivity plots.