Main functionality
We perform a two-tailed bootstrap test, comparing two samples using pre-defined statistics. The library works so that the user provides a function that computes the statistic of interest, and predictions of two models on a test set.
- stambo.compare_models(y_test, preds_1, preds_2, metrics, groups=None, alpha=0.05, n_bootstrap=5000, seed=None, silent=False)[source]
Compares predictions from two models \(f_1(x)\) and \(f_2(x)\) that yield prediction vectors \(\hat y_{1}\) and \(\hat y_{2}\) with a one-tailed bootstrap hypothesis test. Note: you must make sure that the metric is defined as more is better (e.g. accuracy, AUC, and others).
I.e., we state the following null and alternative hypotheses:
\[ \begin{align}\begin{aligned}H_0: M(y_{gt}, \hat y_{1}) \leq M(y_{gt}, \hat y_{2})\\H_1: M(y_{gt}, \hat y_{2}) > M(y_{gt}, \hat y_{1}),\end{aligned}\end{align} \]where \(M\) is a metric, \(y_{gt}\) is the vector of ground truth labels, and \(\hat y_{i}, i=1,2\) are the vectors of predictions for model 1 and 2, respectively. Such kind of testing is performed for every specified metric.
By default, the function assumes that the metrics are defined as more is better (e.g. accuracy, AUC, and others). If you work with metrics that are defined as less is better, just swap the models (\(\hat y_{1}\) and \(\hat y_{2}\)) in the function call.
While the test does return you the \(p\)-value, one should be careful about its interpretation: the \(p\)-value is the probability of observing the test statistic at least as extreme as the one obtained assuming that \(H_0\) is true (probability of Type II error). With large data, even small effects can be statistically significant, so one should consider the effect size.
We compute a standardized effect size using the estimated bootstrap variance.
Beyond the hypothesis testing, the function also returns confidence intervals per metric, i.e.
\[P\left(M(y_{gt,*}, \hat y_*) \in [L_{CI}(\alpha), H_{CI}(\alpha)]\right) = 1 - \alpha,\]where \(L\) and \(H\) are the lower and upper bounds of the confidence interval, respectively, and \(\alpha\) is the significance level, and \(*\) indicates that the metric is computed on infinite data.
At this moment, the confidence intervals are computed using the simple percentile method. In the future, we will implement the BCa approach, which is more accurate.
- Parameters:
y_test (ndarray[tuple[Any, ...], dtype[int]] | ndarray[tuple[Any, ...], dtype[float]]) – Ground truth.
preds_1 (ndarray[tuple[Any, ...], dtype[int]] | ndarray[tuple[Any, ...], dtype[float]]) – Prediction from model 1.
preds_2 (ndarray[tuple[Any, ...], dtype[int]] | ndarray[tuple[Any, ...], dtype[float]]) – Prediction from model 2.
metrics (Tuple[str | Metric]) – A set of metrics to call. Here, the user either specifies the metrics available from the stambo library (
stambo.metrics), or adds an instance of the custom-defined metrics.groups (ndarray[tuple[Any, ...], dtype[int]] | None) – Groups indicating the subject for each measurement. Defaults to None.
alpha (float) – A significance level for confidence intervals (from 0 to 1). Defaults to 0.05.
n_bootstrap (int) – The number of bootstrap iterations. Defaults to 5000.
seed (int | None) – Random seed. Defaults to None.
silent (bool) – Whether to execute the function silently, i.e. not showing the progress bar. Defaults to False.
- Returns:
A dictionary containing a tuple with the empirical value of the metric, and the one-tailed p-value. The expected format in the output in every dict entry is:
One-sided \(p\)-value
Observed difference (effect size)
Effect size CI low
Effect size CI high
\(M(y_{gt}, \hat y_{1})\)
\(M(y_{gt}, \hat y_{1})_{(\alpha / 2)}\)
\(M(y_{gt}, \hat y_{1})_{(1 - \alpha / 2)}\)
\(M(y_{gt}, \hat y_{2})\)
\(M(y_{gt}, \hat y_{2})_{(\alpha / 2)}\)
\(M(y_{gt}, \hat y_{2})_{(1 - \alpha / 2)}\)
- Return type:
- stambo.two_sample_test(sample_1, sample_2, statistics, groups=None, alpha=0.05, n_bootstrap=5000, seed=None, non_paired=False, silent=False)[source]
Compares whether the empirical difference of statistics computed on two samples is statistically significant or not.
The hypotheses we test are:
\[\begin{split}H_0: f(x_1) \leq f(x_2) \\ H_1: f(x_1) > f(x_2),\end{split}\]where \(f\) is a function of interest, and \(x_1\) and \(x_2\) are the samples to be compared. Note that the statistics are computed independently, and should thus be treated independently.
- Parameters:
sample_1 (ndarray[tuple[Any, ...], dtype[int]] | ndarray[tuple[Any, ...], dtype[float]] | PredSampleWrapper) – Sample 1 to be compared.
sample_2 (ndarray[tuple[Any, ...], dtype[int]] | ndarray[tuple[Any, ...], dtype[float]] | PredSampleWrapper) – Sample 2 to be compared.
groups (ndarray[tuple[Any, ...], dtype[int]] | None) – Groups indicating the subject for each measurement. Defaults to None.
statistics (Dict[str, Callable]) – Statistics to compare the samples by.
alpha (float) – A significance level for confidence intervals (from 0 to 1).
n_bootstrap (int) – The number of bootstrap iterations. Defaults to 5000.
non_paired (bool) – Whether to use a non-paired design. Defaults to False.
seed (int) – Random seed. Defaults to None.
silent (bool) – Whether to execute the function silently, i.e. not showing the progress bar. Defaults to False.
- Returns:
A dictionary containing a tuple with the empirical value of the metric, and the p-value. Each entry in the dictionary contains, in order:
Right-tailed \(p(H_0 \mid \texttt{data})\)
Observed difference (effect size)
CI low (effect size)
CI high (effect size)
Empirical value (sample 1)
CI low (sample 1)
CI high (sample 1)
Empirical value (sample 2)
CI low (sample 2)
CI high (sample 2)
- Return type:
- stambo.to_latex(report, m1_name='M1', m2_name='M2', n_digits=2)[source]
Converts a report returned by StamBO into a LaTeX table for convenient viewing.
Note: The alternative hypothesis is that the second model is different from the first model. The p-value is the two-tailed p-value.
- Parameters:
report (Dict[str, Tuple[float]]) – Dictionary with metrics in the StamBO-generated format.
m1_name (str | None) – Name to assign to the first model row. Defaults to M1.
m2_name (str | None) – Name to assign to the second model row. Defaults to M2.
n_digits (int) – Number of digits to round to. Defaults to 2.
- Returns:
A cut-and-paste LaTeX table in the tabular environment.
- Return type:
- class stambo.PredSampleWrapper(predictions, gt, multiclass=True, threshold=0.5, cached_am=None)[source]
Bases:
objectWraps predictions and targets in one object.
- Parameters:
predictions (ndarray[tuple[Any, ...], dtype[float | int]]) – Model predictions to wrap.
gt (ndarray[tuple[Any, ...], dtype[float | int]]) – Ground-truth labels.
multiclass (bool) – Whether the predictions correspond to a multiclass classifier. Defaults to True.
threshold (float | None) – Threshold to apply to binary predictions when
multiclassis False. Defaults to 0.5.cached_am (ndarray[tuple[Any, ...], dtype[int]] | None) – Optional cached argmax / thresholded predictions to reuse.
- __getitem__(idx)[source]
Give access to the predictions and the ground truth by index or a set of indices.
- Parameters:
idx (int | Iterable[int] | ndarray[tuple[Any, ...], dtype[int]]) – Single index or collection of indices.
- Returns:
Either a tuple containing the predictions, argmaxed predictions, and ground truth for a single index, or a new
PredSampleWrapperrestricted to the provided indices.- Return type: