Analysis Overview
This document summarizes what each analysis in this folder does and what it produces.
Runtime Defaults
- Analysis defaults are centralized in
analysis/defaults.py.
- The orchestrator CLI in
analysis/evaluate_models.py is intentionally minimal and focuses on operational controls (--backend, --run-name, --skip-noise).
- Standalone operational modes include
--longitudinal-breakdown-only, --noise-correlation-only, and --dataset-summary-only.
- Threshold grids, uncertainty percentile grids, noise-factor grids, calibration bins, class index, decision threshold, and bayesian MC passes are sourced from
analysis/defaults.py.
Shared Utilities
- Shared loader/split logic is centralized in
analysis/data_pipeline.py.
- All plotting code is centralized in
analysis/plotting.py for easier inspection and maintenance.
1. Performance Threshold Sweep
- Purpose: Measure classification performance as decision threshold changes.
- Inputs: Ground-truth labels and predicted probabilities.
- Method: Evaluate metrics across an evenly spaced threshold grid.
- Main outputs:
performance_threshold_sweep.csv
plots/performance_threshold_accuracy.png
plots/performance_threshold_f1.png
2. Uncertainty Cutoff (Raw-Value)
- Purpose: Evaluate performance on subsets with uncertainty below percentile-derived cutoff values.
- Inputs: Uncertainty arrays (confidence-derived uncertainty and backend-specific uncertainty).
- Method:
- Build evenly spaced percentile points.
- Convert each percentile to a raw uncertainty cutoff value.
- Keep samples where uncertainty is less than or equal to that cutoff.
- Compute accuracy and F1 for each retained subset.
- Main outputs:
performance_uncertainty_cutoff.csv
plots/performance_uncertainty_cutoff_accuracy.png
plots/performance_uncertainty_cutoff_f1.png
3. Uncertainty Cutoff (Percentile-Ranked)
- Purpose: Evaluate performance from all samples toward only the lowest-uncertainty samples.
- Inputs: Same uncertainty arrays as above.
- Method:
- Use an evenly spaced percentile grid.
- Keep samples where uncertainty is less than or equal to the selected percentile cutoff.
- Plot from least restricted on the left (all samples) to most restricted on the right (lowest-uncertainty subset only).
- Main outputs:
performance_uncertainty_percentile_cutoff.csv
plots/performance_uncertainty_percentile_cutoff_accuracy.png
plots/performance_uncertainty_percentile_cutoff_f1.png
4. Calibration Analysis
- Purpose: Quantify probability calibration quality.
- Inputs: Ground-truth labels and predicted probabilities.
- Method:
- Reliability binning with configurable bin count.
- Compute MCE and Brier score.
- Main outputs:
calibration_bins.csv
plots/calibration_reliability.png
5. Physician Confidence Comparison
- Purpose: Compare model uncertainty with physician confidence ratings.
- Inputs: Evaluation outputs plus clinical table (Image Data ID + physician confidence column).
- Method:
- Merge by image ID.
- Group metrics by physician confidence level.
- Plot distributions per rating group.
- Main outputs include grouped summary CSV files and boxplots for confidence and standard deviation (ensemble) or predictive uncertainty (bayesian).
6. Longitudinal Stability Analysis
- Purpose: Examine uncertainty patterns across stable and transitioning patient trajectories.
- Inputs: Evaluation outputs and clinical timeline information.
- Method:
- Build patient-level trajectories.
- Group by clinical cohort dynamics.
- Compare uncertainty summaries across cohorts.
- Main outputs include patient/cohort summary CSV files and cohort uncertainty plots.
7. Noise Sensitivity Analysis
- Purpose: Test robustness and uncertainty behavior under synthetic Gaussian noise.
- Inputs: Holdout data loader, model backend, noise factor schedule, threshold, calibration bins.
- Method:
- Use an evenly spaced noise factor schedule.
- Add Gaussian noise scaled by a fixed intensity-range factor.
- Recompute performance and calibration at each sigma.
- Save visual examples of noised images.
- Main outputs:
noise_sensitivity.csv
plots/noise_sensitivity_accuracy.png
plots/noise_sensitivity_f1.png
plots/noise_confidence.png
plots/noise_standard_deviation.png (ensemble) / plots/noise_predictive_uncertainty.png (bayesian)
plots/noise_examples/*_noise_examples.png
plots/noise_examples/*_clean_scan_example.png
8. Dataset Composition Summary
- Purpose: Report dataset composition without rerunning full evaluation analyses.
- Inputs: Raw dataset files and configured train/validation/test split ratios.
- Method:
- Rebuild dataset and split assignment using the configured split seed.
- Count total images and positive/negative labels overall.
- Count train/validation/test images and per-split class balance.
- Compute both overall and split-level percentages.
- Main outputs:
- CLI:
python -m analysis.evaluate_models --dataset-summary-only
Plot Report
- Each backend output now includes
plots_report.md.
- The report embeds all generated plot images and includes a short description per plot.
Notes on Even Spacing
The pipeline uses evenly spaced grids for sampled x-axis points in threshold and uncertainty-cutoff performance plots, and for the sigma schedule used by noise sensitivity plots.