ise.evaluation

Evaluation metrics for ice sheet emulator predictions.

Covers three categories of metrics:

Point metrics — R², MSE, MAPE, relative squared error, sector-wise MSE.
Probabilistic / uncertainty metrics — CRPS (Gaussian), Winkler score, Expected Calibration Error (ECE), prediction interval width.
Distribution metrics — KL divergence, JS divergence, KS test, two-sample t-test.
Spatial aggregation — sum_by_sector for basin-level aggregation.

Submodules

ise.evaluation.metrics

Evaluation metrics for ISEFlow sea-level projections.

This module collects the full set of metrics used to assess ISEFlow prediction quality across three dimensions: point accuracy, distributional fidelity, and uncertainty calibration.

Point accuracy metrics

r2_score — coefficient of determination (R²). mean_squared_error — MSE between predicted and true SLE. mean_absolute_error — MAE. mape — mean absolute percentage error (skips zero targets). relative_squared_error— RSE = SS_res / SS_tot (0 = perfect, 1 = baseline mean). crps — continuous ranked probability score for Gaussian forecasts (lower is better; uses properscoring library).

Distribution metrics

These compare the distribution of projected SLE values at 2100 across many runs, rather than individual timestep accuracy:

kl_divergence — Kullback-Leibler divergence between predicted and true PDFs. js_divergence — Jensen-Shannon divergence (symmetric, bounded [0, 1]). kolmogorov_smirnov— KS two-sample test statistic and p-value. t_test — two-sample t-test statistic and p-value.

Uncertainty calibration metrics

calculate_ece — Expected Calibration Error (ECE). Measures: how well the predicted std aligns with actual errors by binning predictions by uncertainty level and checking what fraction of true values fall within ±2σ (expected ≈ 95.4 % for a Gaussian). Lower ECE = better calibrated.
mean_prediction_interval_width — average width of prediction intervals: (sharpness proxy; should be small while ECE stays low).
winkler_score — proper scoring rule for interval forecasts at: significance level α. Penalises both wide intervals and violations.

Sector aggregation

sum_by_sector — given a full 2-D (x, y) grid array and a sector-definition

NetCDF file, sums values within each sector mask to produce an (N_timesteps, N_sectors) matrix:

from ise.evaluation.metrics import sum_by_sector
sector_sums = sum_by_sector(predicted_grid, "AIS_sectors_8km.nc")

Usage example

from ise.evaluation.metrics import r2_score, crps, calculate_ece

r2  = r2_score(y_true, predictions)
score = crps(y_true, predictions, uncertainties["total"])
ece = calculate_ece(predictions, uncertainties["total"], y_true)

ise.evaluation.metrics.calculate_ece(predictions, uncertainties, true_values, bins=10)[source]

Computes the Expected Calibration Error (ECE) for a regression model.

Parameters:

predictions (numpy.ndarray) – The predicted mean values.
uncertainties (numpy.ndarray) – The predicted standard deviations.
true_values (numpy.ndarray) – The true values.
bins (int, optional) – The number of bins for uncertainty grouping. Defaults to 10.

Returns:

The Expected Calibration Error (ECE).

Return type:

float

Notes

ECE measures how well predicted uncertainties align with actual errors.
A lower ECE indicates better-calibrated uncertainty estimates.

ise.evaluation.metrics.crps(y_true, y_pred, y_std)[source]

Computes the Continuous Ranked Probability Score (CRPS) for a Gaussian distribution.

Parameters:

y_true (numpy.ndarray) – The true values.
y_pred (numpy.ndarray) – The predicted mean values.
y_std (numpy.ndarray) – The predicted standard deviations.

Returns:

The computed CRPS values for each prediction.

Return type:

numpy.ndarray

ise.evaluation.metrics.js_divergence(p: ndarray, q: ndarray)[source]

Computes the Jensen-Shannon Divergence (JSD) between two probability distributions.

Parameters:

p (numpy.ndarray) – The first probability distribution.
q (numpy.ndarray) – The second probability distribution.

Returns:

The Jensen-Shannon divergence value.

Return type:

float

Notes

JSD is a smoothed and symmetric version of KL divergence.
The function normalizes the distributions before computation.

ise.evaluation.metrics.kl_divergence(p: ndarray, q: ndarray)[source]

Computes the Kullback-Leibler (KL) Divergence between two probability distributions.

Parameters:

p (numpy.ndarray) – The first probability distribution.
q (numpy.ndarray) – The second probability distribution.

Returns:

The KL divergence value.

Return type:

float

Notes

The distributions p and q must be normalized (i.e., sum to 1).
Small epsilon values are used to avoid numerical instability.

ise.evaluation.metrics.kolmogorov_smirnov(x1, x2)[source]

Computes the Kolmogorov-Smirnov (KS) statistic to compare two distributions.

Parameters:

x1 (numpy.ndarray or list) – The first dataset.
x2 (numpy.ndarray or list) – The second dataset.

Returns:

(KS statistic, p-value).

Return type:

tuple

ise.evaluation.metrics.mape(y_true, y_pred)[source]

Computes the Mean Absolute Percentage Error (MAPE).

Parameters:

y_true (numpy.ndarray or list) – The true values.
y_pred (numpy.ndarray or list) – The predicted values.

Returns:

The MAPE value, expressed as a percentage.

Return type:

float

Notes

MAPE ignores zero values in y_true to prevent division by zero.
If all true values are zero, returns infinity.

ise.evaluation.metrics.mean_absolute_error(y_true, y_pred)[source]

Computes the Mean Absolute Error (MAE).

Parameters:

y_true (numpy.ndarray or list) – The true values.
y_pred (numpy.ndarray or list) – The predicted values.

Returns:

The Mean Absolute Error (MAE).

Return type:

float

ise.evaluation.metrics.mean_prediction_interval_width(upper_bound, lower_bound)[source]

Computes the Mean Prediction Interval Width (MPIW).

Parameters:

upper_bound (numpy.ndarray or list) – The upper bounds of the prediction intervals.
lower_bound (numpy.ndarray or list) – The lower bounds of the prediction intervals.

Returns:

The Mean Prediction Interval Width (MPIW).

Return type:

float

ise.evaluation.metrics.mean_squared_error(y_true, y_pred)[source]

Computes the Mean Squared Error (MSE).

Parameters:

y_true (numpy.ndarray or list) – The true values.
y_pred (numpy.ndarray or list) – The predicted values.

Returns:

The Mean Squared Error (MSE).

Return type:

float

ise.evaluation.metrics.mean_squared_error_sector(sum_sectors_true, sum_sectors_pred)[source]

Computes the mean squared error (MSE) between true and predicted sector-wise sums.

Parameters:

sum_sectors_true (numpy.ndarray) – The true summed sector values.
sum_sectors_pred (numpy.ndarray) – The predicted summed sector values.

Returns:

The mean squared error (MSE).

Return type:

float

ise.evaluation.metrics.r2_score(y_true, y_pred)[source]

Computes the coefficient of determination (R² score).

Parameters:

y_true (numpy.ndarray or list) – The true values.
y_pred (numpy.ndarray or list) – The predicted values.

Returns:

The R² score, where 1 indicates perfect predictions.

Return type:

float

ise.evaluation.metrics.relative_squared_error(y_true, y_pred)[source]

Computes the Relative Squared Error (RSE), measuring the error relative to the variance in y_true.

Parameters:

y_true (numpy.ndarray or list) – The true values.
y_pred (numpy.ndarray or list) – The predicted values.

Returns:

The computed RSE value.

Return type:

float

Notes

A lower RSE indicates better performance, with RSE=0 indicating perfect predictions.

ise.evaluation.metrics.sum_by_sector(array, grid_file)[source]

Computes the sum of values in a given array by predefined sectors using a grid file.

Parameters:

array (numpy.ndarray or torch.Tensor) – A 2D or 3D array containing values to be summed by sector.
grid_file (str or xarray.Dataset) – Path to the grid file defining sector boundaries or an xarray dataset.

Returns:

A 2D array where each row represents a timestep and each column represents a sector.

Return type:

numpy.ndarray

Raises:

ValueError – If grid_file is not a valid string or xarray dataset.

ise.evaluation.metrics.t_test(x1, x2)[source]

Performs an independent two-sample t-test to compare the means of two distributions.

Parameters:

x1 (numpy.ndarray or list) – The first dataset.
x2 (numpy.ndarray or list) – The second dataset.

Returns:

(t-statistic, p-value).

Return type:

tuple

ise.evaluation.metrics.winkler_score(y_true, y_pred, lower_bound, upper_bound, alpha=0.05)[source]

Computes the Winkler Score for prediction intervals.

Parameters:

y_true (numpy.ndarray or list) – The true values.
y_pred (numpy.ndarray or list) – The predicted mean values.
lower_bound (numpy.ndarray or list) – The lower bounds of the prediction intervals.
upper_bound (numpy.ndarray or list) – The upper bounds of the prediction intervals.
alpha (float, optional) – The significance level for the prediction intervals. Defaults to 0.05.

Returns:

The Winkler Score.

Return type:

float

Module contents

Evaluation metrics for ice sheet emulator predictions.

This package provides metrics (e.g. R², MSE, CRPS, ECE, sector-wise sums) for assessing point predictions and uncertainty quantification.

ise.evaluation.calculate_ece(predictions, uncertainties, true_values, bins=10)[source]

Computes the Expected Calibration Error (ECE) for a regression model.

Parameters:

predictions (numpy.ndarray) – The predicted mean values.
uncertainties (numpy.ndarray) – The predicted standard deviations.
true_values (numpy.ndarray) – The true values.
bins (int, optional) – The number of bins for uncertainty grouping. Defaults to 10.

Returns:

The Expected Calibration Error (ECE).

Return type:

float

Notes

ECE measures how well predicted uncertainties align with actual errors.
A lower ECE indicates better-calibrated uncertainty estimates.

ise.evaluation.crps(y_true, y_pred, y_std)[source]

Computes the Continuous Ranked Probability Score (CRPS) for a Gaussian distribution.

Parameters:

y_true (numpy.ndarray) – The true values.
y_pred (numpy.ndarray) – The predicted mean values.
y_std (numpy.ndarray) – The predicted standard deviations.

Returns:

The computed CRPS values for each prediction.

Return type:

numpy.ndarray

ise.evaluation.js_divergence(p: ndarray, q: ndarray)[source]

Computes the Jensen-Shannon Divergence (JSD) between two probability distributions.

Parameters:

p (numpy.ndarray) – The first probability distribution.
q (numpy.ndarray) – The second probability distribution.

Returns:

The Jensen-Shannon divergence value.

Return type:

float

Notes

JSD is a smoothed and symmetric version of KL divergence.
The function normalizes the distributions before computation.

ise.evaluation.kl_divergence(p: ndarray, q: ndarray)[source]

Computes the Kullback-Leibler (KL) Divergence between two probability distributions.

Parameters:

p (numpy.ndarray) – The first probability distribution.
q (numpy.ndarray) – The second probability distribution.

Returns:

The KL divergence value.

Return type:

float

Notes

The distributions p and q must be normalized (i.e., sum to 1).
Small epsilon values are used to avoid numerical instability.

ise.evaluation.kolmogorov_smirnov(x1, x2)[source]

Computes the Kolmogorov-Smirnov (KS) statistic to compare two distributions.

Parameters:

x1 (numpy.ndarray or list) – The first dataset.
x2 (numpy.ndarray or list) – The second dataset.

Returns:

(KS statistic, p-value).

Return type:

tuple

ise.evaluation.mape(y_true, y_pred)[source]

Computes the Mean Absolute Percentage Error (MAPE).

Parameters:

y_true (numpy.ndarray or list) – The true values.
y_pred (numpy.ndarray or list) – The predicted values.

Returns:

The MAPE value, expressed as a percentage.

Return type:

float

Notes

MAPE ignores zero values in y_true to prevent division by zero.
If all true values are zero, returns infinity.

ise.evaluation.mean_absolute_error(y_true, y_pred)[source]

Computes the Mean Absolute Error (MAE).

Parameters:

y_true (numpy.ndarray or list) – The true values.
y_pred (numpy.ndarray or list) – The predicted values.

Returns:

The Mean Absolute Error (MAE).

Return type:

float

ise.evaluation.mean_prediction_interval_width(upper_bound, lower_bound)[source]

Computes the Mean Prediction Interval Width (MPIW).

Parameters:

upper_bound (numpy.ndarray or list) – The upper bounds of the prediction intervals.
lower_bound (numpy.ndarray or list) – The lower bounds of the prediction intervals.

Returns:

The Mean Prediction Interval Width (MPIW).

Return type:

float

ise.evaluation.mean_squared_error(y_true, y_pred)[source]

Computes the Mean Squared Error (MSE).

Parameters:

y_true (numpy.ndarray or list) – The true values.
y_pred (numpy.ndarray or list) – The predicted values.

Returns:

The Mean Squared Error (MSE).

Return type:

float

ise.evaluation.mean_squared_error_sector(sum_sectors_true, sum_sectors_pred)[source]

Computes the mean squared error (MSE) between true and predicted sector-wise sums.

Parameters:

sum_sectors_true (numpy.ndarray) – The true summed sector values.
sum_sectors_pred (numpy.ndarray) – The predicted summed sector values.

Returns:

The mean squared error (MSE).

Return type:

float

ise.evaluation.r2_score(y_true, y_pred)[source]

Computes the coefficient of determination (R² score).

Parameters:

y_true (numpy.ndarray or list) – The true values.
y_pred (numpy.ndarray or list) – The predicted values.

Returns:

The R² score, where 1 indicates perfect predictions.

Return type:

float

ise.evaluation.relative_squared_error(y_true, y_pred)[source]

Computes the Relative Squared Error (RSE), measuring the error relative to the variance in y_true.

Parameters:

y_true (numpy.ndarray or list) – The true values.
y_pred (numpy.ndarray or list) – The predicted values.

Returns:

The computed RSE value.

Return type:

float

Notes

A lower RSE indicates better performance, with RSE=0 indicating perfect predictions.

ise.evaluation.sum_by_sector(array, grid_file)[source]

Computes the sum of values in a given array by predefined sectors using a grid file.

Parameters:

array (numpy.ndarray or torch.Tensor) – A 2D or 3D array containing values to be summed by sector.
grid_file (str or xarray.Dataset) – Path to the grid file defining sector boundaries or an xarray dataset.

Returns:

A 2D array where each row represents a timestep and each column represents a sector.

Return type:

numpy.ndarray

Raises:

ValueError – If grid_file is not a valid string or xarray dataset.

ise.evaluation.t_test(x1, x2)[source]

Performs an independent two-sample t-test to compare the means of two distributions.

Parameters:

x1 (numpy.ndarray or list) – The first dataset.
x2 (numpy.ndarray or list) – The second dataset.

Returns:

(t-statistic, p-value).

Return type:

tuple

ise.evaluation.winkler_score(y_true, y_pred, lower_bound, upper_bound, alpha=0.05)[source]

Computes the Winkler Score for prediction intervals.

Parameters:

y_true (numpy.ndarray or list) – The true values.
y_pred (numpy.ndarray or list) – The predicted mean values.
lower_bound (numpy.ndarray or list) – The lower bounds of the prediction intervals.
upper_bound (numpy.ndarray or list) – The upper bounds of the prediction intervals.
alpha (float, optional) – The significance level for the prediction intervals. Defaults to 0.05.

Returns:

The Winkler Score.

Return type:

float