ANALYSES_OVERVIEW.md 5.4 KB

Analysis Overview

This document summarizes what each analysis in this folder does and what it produces.

Runtime Defaults

  • Analysis defaults are centralized in analysis/defaults.py.
  • The orchestrator CLI in analysis/evaluate_models.py is intentionally minimal and focuses on operational controls (--backend, --run-name, --skip-noise).
  • Standalone operational modes include --longitudinal-breakdown-only, --noise-correlation-only, and --dataset-summary-only.
  • Threshold grids, uncertainty percentile grids, noise-factor grids, calibration bins, class index, decision threshold, and bayesian MC passes are sourced from analysis/defaults.py.

Shared Utilities

  • Shared loader/split logic is centralized in analysis/data_pipeline.py.
  • All plotting code is centralized in analysis/plotting.py for easier inspection and maintenance.

1. Performance Threshold Sweep

  • Purpose: Measure classification performance as decision threshold changes.
  • Inputs: Ground-truth labels and predicted probabilities.
  • Method: Evaluate metrics across an evenly spaced threshold grid.
  • Main outputs:
    • performance_threshold_sweep.csv
    • plots/performance_threshold_accuracy.png
    • plots/performance_threshold_f1.png

2. Uncertainty Cutoff (Raw-Value)

  • Purpose: Evaluate performance on subsets with uncertainty below percentile-derived cutoff values.
  • Inputs: Uncertainty arrays (confidence-derived uncertainty and backend-specific uncertainty).
  • Method:
    • Build evenly spaced percentile points.
    • Convert each percentile to a raw uncertainty cutoff value.
    • Keep samples where uncertainty is less than or equal to that cutoff.
    • Compute accuracy and F1 for each retained subset.
  • Main outputs:
    • performance_uncertainty_cutoff.csv
    • plots/performance_uncertainty_cutoff_accuracy.png
    • plots/performance_uncertainty_cutoff_f1.png

3. Uncertainty Cutoff (Percentile-Ranked)

  • Purpose: Evaluate performance from all samples toward only the lowest-uncertainty samples.
  • Inputs: Same uncertainty arrays as above.
  • Method:
    • Use an evenly spaced percentile grid.
    • Keep samples where uncertainty is less than or equal to the selected percentile cutoff.
    • Plot from least restricted on the left (all samples) to most restricted on the right (lowest-uncertainty subset only).
  • Main outputs:
    • performance_uncertainty_percentile_cutoff.csv
    • plots/performance_uncertainty_percentile_cutoff_accuracy.png
    • plots/performance_uncertainty_percentile_cutoff_f1.png

4. Calibration Analysis

  • Purpose: Quantify probability calibration quality.
  • Inputs: Ground-truth labels and predicted probabilities.
  • Method:
    • Reliability binning with configurable bin count.
    • Compute MCE and Brier score.
  • Main outputs:
    • calibration_bins.csv
    • plots/calibration_reliability.png

5. Physician Confidence Comparison

  • Purpose: Compare model uncertainty with physician confidence ratings.
  • Inputs: Evaluation outputs plus clinical table (Image Data ID + physician confidence column).
  • Method:
    • Merge by image ID.
    • Group metrics by physician confidence level.
    • Plot distributions per rating group.
  • Main outputs include grouped summary CSV files and boxplots for confidence and standard deviation (ensemble) or predictive uncertainty (bayesian).

6. Longitudinal Stability Analysis

  • Purpose: Examine uncertainty patterns across stable and transitioning patient trajectories.
  • Inputs: Evaluation outputs and clinical timeline information.
  • Method:
    • Build patient-level trajectories.
    • Group by clinical cohort dynamics.
    • Compare uncertainty summaries across cohorts.
  • Main outputs include patient/cohort summary CSV files and cohort uncertainty plots.

7. Noise Sensitivity Analysis

  • Purpose: Test robustness and uncertainty behavior under synthetic Gaussian noise.
  • Inputs: Holdout data loader, model backend, noise factor schedule, threshold, calibration bins.
  • Method:
    • Use an evenly spaced noise factor schedule.
    • Add Gaussian noise scaled by a fixed intensity-range factor.
    • Recompute performance and calibration at each sigma.
    • Save visual examples of noised images.
  • Main outputs:
    • noise_sensitivity.csv
    • plots/noise_sensitivity_accuracy.png
    • plots/noise_sensitivity_f1.png
    • plots/noise_confidence.png
    • plots/noise_standard_deviation.png (ensemble) / plots/noise_predictive_uncertainty.png (bayesian)
    • plots/noise_examples/*_noise_examples.png
    • plots/noise_examples/*_clean_scan_example.png

8. Dataset Composition Summary

  • Purpose: Report dataset composition without rerunning full evaluation analyses.
  • Inputs: Raw dataset files and configured train/validation/test split ratios.
  • Method:
    • Rebuild dataset and split assignment using the configured split seed.
    • Count total images and positive/negative labels overall.
    • Count train/validation/test images and per-split class balance.
    • Compute both overall and split-level percentages.
  • Main outputs:
    • dataset_summary.md
  • CLI:
    • python -m analysis.evaluate_models --dataset-summary-only

Plot Report

  • Each backend output now includes plots_report.md.
  • The report embeds all generated plot images and includes a short description per plot.

Notes on Even Spacing

The pipeline uses evenly spaced grids for sampled x-axis points in threshold and uncertainty-cutoff performance plots, and for the sigma schedule used by noise sensitivity plots.