ise.utils

Utility functions for data loading, tensor conversion, file I/O, and package path helpers.

functions — get_X_y, get_data, to_tensor, unscale_output, unscale_input, get_all_filepaths, load_model, and helpers for uncertainty bands, KDE distributions, and dataset combination.
io — check_type runtime type validation.
ismip6_model_configs_path — absolute path to the bundled ISM configuration JSON used by ISEFlowAISInputs and FeatureEngineer.

Submodules

ise.utils.functions

General-purpose utilities for the ISE package.

This module collects helper functions used across the codebase for tensor conversion, file discovery, data loading, scaling, and training-data extraction.

Tensor utilities

to_tensor(x): Convert a pd.DataFrame, np.ndarray, or torch.Tensor to a torch.float32 tensor. Used pervasively in model fit() / predict() methods to accept any of the three common input types without requiring callers to pre-convert.

File utilities

get_all_filepaths(path, filetype, contains, not_contains): Recursively walk path and return all file paths matching the given extension and substring filters. Used by process_sectors() and DatasetMerger to discover NetCDF and CSV files in the GHub directory.

Training data helpers

get_X_y(data, dataset_type, return_format, cols, with_chars)

Split a merged dataset CSV into feature matrix X and target vector y, dropping metadata columns (exp, model, aogcm, etc.) and optionally dropping ISM characteristic columns. Supports 'numpy', 'tensor', and 'pandas' return formats:

from ise.utils.functions import get_X_y, get_data

X, y = get_X_y("splits/train.csv", dataset_type="sectors",
                return_format="tensor")

get_data(data_dir, dataset_type, return_format)

Convenience wrapper that calls get_X_y on train.csv, val.csv, and test.csv in a single call:

X_train, y_train, X_val, y_val, X_test, y_test = get_data("splits/")

Scaling / unscaling

unscale_output(y, scaler_path) — inverse-transform y with a saved sklearn scaler. unscale_input(X, scaler_path) — inverse-transform X features (DataFrame only). unscale_column(dataset, column) — revert MinMax scaling on year / sectors columns back to their original ranges (2015-2100 and 1-18 for AIS or 1-6 for GrIS).

Post-processing

combine_testing_results(...) — join test predictions with true values and metadata: into a single DataFrame for analysis.
get_uncertainty_bands(data, confidence, quantiles): — compute mean, std, confidence intervals, and quantile bands across an ensemble of projection arrays.

Misc

check_input(input, options, argname) — validate a string argument against an allowed list. undummify(df, prefix_sep) — reverse pd.get_dummies encoding. create_distribution(dataset, ...) — build a KDE probability density function. calculate_distribution_metrics(...) — compute KL and JS divergence between true and predicted distributions.

ise.utils.functions.add_variable_to_nc(source_file_path, target_file_path, variable_name)[source]

Copies a variable from a source NetCDF file to a target NetCDF file.

Parameters:

source_file_path (str) – Path to the source NetCDF file.
target_file_path (str) – Path to the target NetCDF file.
variable_name (str) – Name of the variable to be copied.

Raises:

FileNotFoundError – If the specified variable is not found in the source file.

ise.utils.functions.calculate_distribution_metrics(dataset: DataFrame, column: str | None = None, condition: str | None = None)[source]

Computes distribution divergence metrics between true and predicted values.

This function groups the dataset by simulation runs, creates probability distributions for true and predicted values, and calculates the Kullback-Leibler (KL) and Jensen-Shannon (JS) divergences.

Parameters:

dataset (pd.DataFrame) – The dataset containing true and predicted values.
column (str, optional) – Column name to subset on. Defaults to None.
condition (str, optional) – Value to filter the dataset based on the specified column. Defaults to None.

Returns:

A dictionary containing:

’kl’ (float): KL-Divergence value.
’js’ (float): Jensen-Shannon Divergence value.

Return type:

dict

ise.utils.functions.check_input(input: str, options: list[str], argname: str | None = None)[source]

Validates whether a given input string is within an expected list of options.

Parameters:

input (str) – The input value to validate.
options (List[str]) – A list of valid options.
argname (str, optional) – Name of the argument being checked for better error messaging. Defaults to None.

Raises:

ValueError – If the input is not in the list of allowed options.

ise.utils.functions.combine_testing_results(data_directory: str, preds: ndarray, sd: ndarray | DataFrame | None = None, gp_data: dict | None = None, time_series: bool = True, save_directory: str | None = None)[source]

Combines test results into a DataFrame with predictions, uncertainties, and true values.

Parameters:

data_directory (str) – Directory containing training and testing data.
preds (np.ndarray | pd.Series | str) – Predictions array, Series, or path to a CSV file with predictions.
sd (dict | pd.DataFrame, optional) – Standard deviations for uncertainty estimation. Defaults to None.
gp_data (dict | pd.DataFrame, optional) – Gaussian process predictions and standard deviations. Defaults to None.
time_series (bool, optional) – Whether to process time-series data. Defaults to True.
save_directory (str, optional) – Directory where results should be saved. Defaults to None.

Returns:

DataFrame containing test results with true values, predictions, errors, and uncertainty bounds.

Return type:

pd.DataFrame

ise.utils.functions.create_distribution(dataset: ndarray, min_range=-30, max_range=20, step=0.01)[source]

Creates a probability density function (PDF) using Gaussian kernel density estimation (KDE).

Parameters:

dataset (np.ndarray) – Input data for KDE.
min_range (float, optional) – Minimum range for support values. Defaults to -30.
max_range (float, optional) – Maximum range for support values. Defaults to 20.
step (float, optional) – Step size for the support values. Defaults to 0.01.

Returns:

A tuple containing:

density (np.ndarray): Density values from KDE.
support (np.ndarray): Support values for the density function.

Return type:

tuple

ise.utils.functions.get_X_y(data, dataset_type='sectors', return_format=None, cols='all', with_chars=True)[source]

Extracts input features (X) and target labels (y) from a dataset.

Supports various dataset types (sectors, regions, scenarios) and formats (numpy, tensor, pandas).

Parameters:

data (str or pd.DataFrame) – Filepath to the dataset CSV or a pandas DataFrame.
dataset_type (str, optional) – The type of dataset (‘sectors’, ‘regions’, ‘scenarios’). Defaults to “sectors”.
return_format (str, optional) – Format of the returned data (‘numpy’, ‘tensor’, or ‘pandas’). Defaults to None.
cols (str or list, optional) – Columns to include in the features. Defaults to “all”.
with_chars (bool, optional) – Whether to include characteristic columns in features. Defaults to True.

Returns:

A tuple containing:

X (pd.DataFrame or np.ndarray or torch.Tensor): The input features.
y (pd.DataFrame or np.ndarray or torch.Tensor): The target labels.
scenarios (list, optional): Scenario identifiers if dataset type is “regions”.

Return type:

tuple

Retrieves all filepaths for files within a directory. Supports subsetting based on filetype and substring search.

Parameters:

path (str) – Path to directory to be searched.
filetype (str, optional) – File type to be returned (e.g. csv, nc). Defaults to None.
contains (str, optional) – Substring that files found must contain. Defaults to None.
not_contains (str, optional) – Substring that files found must NOT contain. Defaults to None.

Returns:

list of files within the directory matching the input criteria.

Return type:

List[str]

ise.utils.functions.get_data(data_dir, dataset_type='sectors', return_format='tensor')[source]

Loads training, validation, and test datasets, formatting them for model training.

Parameters:

data_dir (str) – Path to the directory containing the dataset files.
dataset_type (str, optional) – Type of dataset (‘sectors’ or ‘scenarios’). Defaults to ‘sectors’.
return_format (str, optional) – Format of the returned data (‘tensor’, ‘numpy’, or ‘pandas’). Defaults to ‘tensor’.

Returns:

A tuple containing:

X_train (pd.DataFrame, np.ndarray, or torch.Tensor): Training features.
y_train (pd.DataFrame, np.ndarray, or torch.Tensor): Training labels.
X_val (pd.DataFrame, np.ndarray, or torch.Tensor): Validation features.
y_val (pd.DataFrame, np.ndarray, or torch.Tensor): Validation labels.
X_test (pd.DataFrame, np.ndarray, or torch.Tensor): Testing features.
y_test (pd.DataFrame, np.ndarray, or torch.Tensor): Testing labels.

Return type:

tuple

ise.utils.functions.get_device() → str[source]: Return ‘cuda’ or ‘cpu’, suppressing the PyTorch CUDA init warning on broken drivers.

ise.utils.functions.get_uncertainty_bands(data: DataFrame, confidence: str = '95', quantiles: list[float] = [0.05, 0.95])[source]

Computes uncertainty bands using confidence intervals and quantiles.

Parameters:

data (pd.DataFrame) – Data matrix of shape (N, M), where N is samples and M is time steps.
confidence (str, optional) – Confidence level (‘95’ or ‘99’). Defaults to “95”.
quantiles (List[float], optional) – Quantiles for uncertainty bands. Defaults to [0.05, 0.95].

Returns:

A tuple containing:

mean (np.ndarray): Mean values.
sd (np.ndarray): Standard deviation values.
upper_ci (np.ndarray): Upper confidence interval.
lower_ci (np.ndarray): Lower confidence interval.
upper_q (np.ndarray): Upper quantile bound.
lower_q (np.ndarray): Lower quantile bound.

Return type:

tuple

ise.utils.functions.group_by_run(dataset: DataFrame, column: str | None = None, condition: str | None = None)[source]

Groups dataset simulations into structured matrices for true and predicted values.

Parameters:

dataset (pd.DataFrame) – Dataset containing simulation results.
column (str, optional) – Column name to subset on. Defaults to None.
condition (str, optional) – Condition for filtering the dataset. Defaults to None.

Returns:

A tuple containing:

all_trues (np.ndarray): Matrix of true values (N x M, where N is the number of simulations and M is time steps).
all_preds (np.ndarray): Matrix of predicted values.
scenarios (list): List of scenario information for each simulation.

Return type:

tuple

ise.utils.functions.load_ml_data(data_directory: str, time_series: bool = True)[source]

Loads machine learning training and testing data from CSV files.

Parameters:

data_directory (str) – Directory containing the processed data files.
time_series (bool, optional) – Whether to load the time-series version of the data. Defaults to True.

Returns:

A tuple containing:

train_features (pd.DataFrame): Training feature set.
train_labels (pd.Series): Training labels.
test_features (pd.DataFrame): Testing feature set.
test_labels (pd.Series): Testing labels.
test_scenarios (list): List of test scenarios.

Return type:

tuple

ise.utils.functions.load_model(model_path, model_class, architecture, mc_dropout=False, dropout_prob=0.1)[source]

Loads a PyTorch model from a saved state_dict file.

Parameters:

model_path (str) – Path to the model’s state dictionary file.
model_class (type) – Class reference of the model to be loaded.
architecture (dict) – Dictionary specifying the architecture of the model.
mc_dropout (bool, optional) – Whether the model uses Monte Carlo Dropout. Defaults to False.
dropout_prob (float, optional) – Dropout probability if MC Dropout is used. Defaults to 0.1.

Returns:

The loaded PyTorch model set to the available device (CPU/GPU).

Return type:

torch.nn.Module

ise.utils.functions.to_tensor(x)[source]

Converts input data into a PyTorch tensor with float32 dtype.

Parameters:: x (pd.DataFrame, np.ndarray, or torch.Tensor) – Input data.
Returns:: Converted tensor.
Return type:: torch.Tensor
Raises:: ValueError – If the input data type is not supported.

ise.utils.functions.undummify(df: DataFrame, prefix_sep: str = '-')[source]

Converts a one-hot encoded dataframe back to its categorical form.

Parameters:

df (pd.DataFrame) – DataFrame containing one-hot encoded categorical columns.
prefix_sep (str, optional) – Separator used in column names to identify categories. Defaults to “-“.

Returns:

DataFrame with categorical values restored.

Return type:

pd.DataFrame

ise.utils.functions.unscale_column(dataset: DataFrame, column: str | list[str] = 'year', ice_sheet: str = 'AIS')[source]

Unscales specified columns back to their original range using known value distributions.

This function is specifically used to revert the normalization of ‘year’ and ‘sectors’ columns since they have known value ranges.

Parameters:

dataset (pd.DataFrame) – Dataset containing the scaled columns.
column (str or list, optional) – Column(s) to be unscaled. Can be ‘year’, ‘sectors’, or a list containing both. Defaults to “year”.
ice_sheet (str, optional) – ‘AIS’ (18 sectors) or ‘GrIS’ (6 basins). Only relevant when ‘sectors’ is in column. Defaults to “AIS”.

Returns:

Dataset with the specified column(s) unscaled.

Return type:

pd.DataFrame

ise.utils.functions.unscale_input(X, scaler_path)[source]

Inverse-transform input features using a saved sklearn scaler (e.g. scaler_X.pkl).

Only the columns listed in the scaler’s feature_names_in_ are inverse-transformed; other columns pass through unchanged. Works with any sklearn scaler that exposes get_feature_names_out (i.e. fitted on a DataFrame).

Parameters:

X (pd.DataFrame) – Scaled input features.
scaler_path (str) – Path to a pickled sklearn scaler.

Returns:

The inverse-transformed features in the original column order.

Return type:

pd.DataFrame

ise.utils.functions.unscale_output(y, scaler_path)[source]

Inverse-transform a target array using a saved sklearn scaler (e.g. scaler_y.pkl).

Works with any sklearn scaler that implements inverse_transform (StandardScaler, MinMaxScaler, RobustScaler, etc.).

Parameters:

y (np.ndarray) – Scaled target values.
scaler_path (str) – Path to a pickled sklearn scaler.

Returns:

The inverse-transformed target values.

Return type:

np.ndarray

ise.utils.io

Runtime type-checking utilities for the ise package.

This module provides check_type, a thin wrapper around isinstance that raises a descriptive TypeError when an argument does not match the expected type. It is used at the boundaries of public-facing functions to give clear error messages rather than cryptic internal AttributeErrors:

from ise.utils.io import check_type

def process(data, grid):
    check_type(data, pd.DataFrame)
    check_type(grid, (str, xr.Dataset))
    ...

ise.utils.io.check_type(obj, types)[source]

Validate that an object is an instance of the given type(s).

Parameters:

obj – Object to check.
types – Expected type or tuple of types.

Returns:

1 if validation passes (for use in conditional logic).

Return type:

int

Raises:

TypeError – If obj is not an instance of types.

Module contents

Utilities for I/O, data loading, and package paths.

This package provides functions for loading/saving data, path helpers (e.g. ismip6_model_configs_path), and tensor/data transformations.

ise.utils.get_X_y(data, dataset_type='sectors', return_format=None, cols='all', with_chars=True)[source]

Extracts input features (X) and target labels (y) from a dataset.

Supports various dataset types (sectors, regions, scenarios) and formats (numpy, tensor, pandas).

Parameters:

data (str or pd.DataFrame) – Filepath to the dataset CSV or a pandas DataFrame.
dataset_type (str, optional) – The type of dataset (‘sectors’, ‘regions’, ‘scenarios’). Defaults to “sectors”.
return_format (str, optional) – Format of the returned data (‘numpy’, ‘tensor’, or ‘pandas’). Defaults to None.
cols (str or list, optional) – Columns to include in the features. Defaults to “all”.
with_chars (bool, optional) – Whether to include characteristic columns in features. Defaults to True.

Returns:

A tuple containing:

X (pd.DataFrame or np.ndarray or torch.Tensor): The input features.
y (pd.DataFrame or np.ndarray or torch.Tensor): The target labels.
scenarios (list, optional): Scenario identifiers if dataset type is “regions”.

Return type:

tuple

ise.utils.get_data(data_dir, dataset_type='sectors', return_format='tensor')[source]

Loads training, validation, and test datasets, formatting them for model training.

Parameters:

data_dir (str) – Path to the directory containing the dataset files.
dataset_type (str, optional) – Type of dataset (‘sectors’ or ‘scenarios’). Defaults to ‘sectors’.
return_format (str, optional) – Format of the returned data (‘tensor’, ‘numpy’, or ‘pandas’). Defaults to ‘tensor’.

Returns:

A tuple containing:

X_train (pd.DataFrame, np.ndarray, or torch.Tensor): Training features.
y_train (pd.DataFrame, np.ndarray, or torch.Tensor): Training labels.
X_val (pd.DataFrame, np.ndarray, or torch.Tensor): Validation features.
y_val (pd.DataFrame, np.ndarray, or torch.Tensor): Validation labels.
X_test (pd.DataFrame, np.ndarray, or torch.Tensor): Testing features.
y_test (pd.DataFrame, np.ndarray, or torch.Tensor): Testing labels.

Return type:

tuple

ise.utils.get_device() → str[source]: Return ‘cuda’ or ‘cpu’, suppressing the PyTorch CUDA init warning on broken drivers.

ise.utils.unscale_output(y, scaler_path)[source]

Inverse-transform a target array using a saved sklearn scaler (e.g. scaler_y.pkl).

Works with any sklearn scaler that implements inverse_transform (StandardScaler, MinMaxScaler, RobustScaler, etc.).

Parameters:

y (np.ndarray) – Scaled target values.
scaler_path (str) – Path to a pickled sklearn scaler.

Returns:

The inverse-transformed target values.

Return type:

np.ndarray