# Analysis Overview This document summarizes what each analysis in this folder does and what it produces. ## Runtime Defaults - Analysis defaults are centralized in `analysis/defaults.py`. - The orchestrator CLI in `analysis/evaluate_models.py` is intentionally minimal and focuses on operational controls (`--backend`, `--run-name`, `--skip-noise`). - Standalone operational modes include `--longitudinal-breakdown-only`, `--noise-correlation-only`, and `--dataset-summary-only`. - Threshold grids, uncertainty percentile grids, noise-factor grids, calibration bins, class index, decision threshold, and bayesian MC passes are sourced from `analysis/defaults.py`. ## Shared Utilities - Shared loader/split logic is centralized in `analysis/data_pipeline.py`. - All plotting code is centralized in `analysis/plotting.py` for easier inspection and maintenance. ## 1. Performance Threshold Sweep - Purpose: Measure classification performance as decision threshold changes. - Inputs: Ground-truth labels and predicted probabilities. - Method: Evaluate metrics across an evenly spaced threshold grid. - Main outputs: - `performance_threshold_sweep.csv` - `plots/performance_threshold_accuracy.png` - `plots/performance_threshold_f1.png` ## 2. Uncertainty Cutoff (Raw-Value) - Purpose: Evaluate performance on subsets with uncertainty below percentile-derived cutoff values. - Inputs: Uncertainty arrays (confidence-derived uncertainty and backend-specific uncertainty). - Method: - Build evenly spaced percentile points. - Convert each percentile to a raw uncertainty cutoff value. - Keep samples where uncertainty is less than or equal to that cutoff. - Compute accuracy and F1 for each retained subset. - Main outputs: - `performance_uncertainty_cutoff.csv` - `plots/performance_uncertainty_cutoff_accuracy.png` - `plots/performance_uncertainty_cutoff_f1.png` ## 3. Uncertainty Cutoff (Percentile-Ranked) - Purpose: Evaluate performance from all samples toward only the lowest-uncertainty samples. - Inputs: Same uncertainty arrays as above. - Method: - Use an evenly spaced percentile grid. - Keep samples where uncertainty is less than or equal to the selected percentile cutoff. - Plot from least restricted on the left (all samples) to most restricted on the right (lowest-uncertainty subset only). - Main outputs: - `performance_uncertainty_percentile_cutoff.csv` - `plots/performance_uncertainty_percentile_cutoff_accuracy.png` - `plots/performance_uncertainty_percentile_cutoff_f1.png` ## 4. Calibration Analysis - Purpose: Quantify probability calibration quality. - Inputs: Ground-truth labels and predicted probabilities. - Method: - Reliability binning with configurable bin count. - Compute MCE and Brier score. - Main outputs: - `calibration_bins.csv` - `plots/calibration_reliability.png` ## 5. Physician Confidence Comparison - Purpose: Compare model uncertainty with physician confidence ratings. - Inputs: Evaluation outputs plus clinical table (Image Data ID + physician confidence column). - Method: - Merge by image ID. - Group metrics by physician confidence level. - Plot distributions per rating group. - Main outputs include grouped summary CSV files and boxplots for confidence and standard deviation (ensemble) or predictive uncertainty (bayesian). ## 6. Longitudinal Stability Analysis - Purpose: Examine uncertainty patterns across stable and transitioning patient trajectories. - Inputs: Evaluation outputs and clinical timeline information. - Method: - Build patient-level trajectories. - Group by clinical cohort dynamics. - Compare uncertainty summaries across cohorts. - Main outputs include patient/cohort summary CSV files and cohort uncertainty plots. ## 7. Noise Sensitivity Analysis - Purpose: Test robustness and uncertainty behavior under synthetic Gaussian noise. - Inputs: Holdout data loader, model backend, noise factor schedule, threshold, calibration bins. - Method: - Use an evenly spaced noise factor schedule. - Add Gaussian noise scaled by a fixed intensity-range factor. - Recompute performance and calibration at each sigma. - Save visual examples of noised images. - Main outputs: - `noise_sensitivity.csv` - `plots/noise_sensitivity_accuracy.png` - `plots/noise_sensitivity_f1.png` - `plots/noise_confidence.png` - `plots/noise_standard_deviation.png` (ensemble) / `plots/noise_predictive_uncertainty.png` (bayesian) - `plots/noise_examples/*_noise_examples.png` - `plots/noise_examples/*_clean_scan_example.png` ## 8. Dataset Composition Summary - Purpose: Report dataset composition without rerunning full evaluation analyses. - Inputs: Raw dataset files and configured train/validation/test split ratios. - Method: - Rebuild dataset and split assignment using the configured split seed. - Count total images and positive/negative labels overall. - Count train/validation/test images and per-split class balance. - Compute both overall and split-level percentages. - Main outputs: - `dataset_summary.md` - CLI: - `python -m analysis.evaluate_models --dataset-summary-only` ## Plot Report - Each backend output now includes `plots_report.md`. - The report embeds all generated plot images and includes a short description per plot. ## Notes on Even Spacing The pipeline uses evenly spaced grids for sampled x-axis points in threshold and uncertainty-cutoff performance plots, and for the sigma schedule used by noise sensitivity plots.