Main functionality

We perform a two-tailed bootstrap test, comparing two samples using pre-defined statistics. The library works so that the user provides a function that computes the statistic of interest, and predictions of two models on a test set.

stambo.compare_models(y_test: ndarray[Any, dtype[int64]] | ndarray[Any, dtype[float64]], preds_1: ndarray[Any, dtype[int64]] | ndarray[Any, dtype[float64]], preds_2: ndarray[Any, dtype[int64]] | ndarray[Any, dtype[float64]], metrics: Tuple[str | Metric], groups: ndarray[Any, dtype[int64]] | None = None, alpha: float | None = 0.05, n_bootstrap: int = 10000, seed: int | None = None, silent: bool = False) → Dict[str, Tuple[float]][source]

Compares predictions from two models \(f_1(x)\) and \(f_2(x)\) that yield prediction vectors \(\hat y_{1}\) and \(\hat y_{2}\) with a two-tailed bootstrap hypothesis test.

I.e., we state the following null and alternative hypotheses:

\[ \begin{align}\begin{aligned}H_0: M(y_{gt}, \hat y_{1}) = M(y_{gt}, \hat y_{2})\\H_1: M(y_{gt}, \hat y_{1}) != M(y_{gt}, \hat y_{2}),\end{aligned}\end{align} \]

where \(M\) is a metric, \(y_{gt}\) is the vector of ground truth labels, and \(\hat y_{i}, i=1,2\) are the vectors of predictions for model 1 and 2, respectively. Such kind of testing is performed for every specified metric.

By default, the function assumes that the metrics are defined as more is better (e.g. accuracy, AUC, and others). If you work with metrics that are defined as less is better, just swap the models (preds_1 and preds_2) in the function call.

While the test does return you the \(p\)-value, one should be careful about its interpretation: the \(p\)-value is the probability of observing the test statistic at least as extreme as the one obtained assuming that:math:H_0 is true (probability of Type II error). With large data, even small effects can be statistically significant, so one should consider the effect size.

We compute a standardized effect size using the estimated bootstrap variance.

Beyond the hypothesis testing, the function also returns confidence intervals per metric, i.e. \([M(y_{gt}, \hat y)_{(\alpha / 2)}, M(y_{gt}, \hat y)_{(1 - \alpha / 2)}]\). At this moment, the confidence intervals are computed using the simple percentile method. In the future, we will implement the BCa approach, which is more accurate.

Parameters:

y_test (Union[npt.NDArray[np.int64], npt.NDArray[np.float64]]) – Ground truth
preds_1 (Union[npt.NDArray[np.int64], npt.NDArray[np.float64]]) – Prediction from model 1
preds_2 (Union[npt.NDArray[np.int64], npt.NDArray[np.float64]]) – Prediction from model 2
metrics (Tuple[Union[str, Metric]]) – A set of metrics to call. Here, the user either specifies the metrics available from the stambo library (stambo.metrics), or adds an instance of the custom-defined metrics.
groups (Optional[npt.NDArray[np.int64]]) – Groups indicating the subject for each measurement.
alpha (float, optional) – A significance level for confidence intervals (from 0 to 1).
n_bootstrap (int, optional) – The number of bootstrap iterations. Defaults to 10000.
seed (int, optional) – Random seed. Defaults to None.
silent (bool, optional) – Whether to execute the function silently, i.e. not showing the progress bar. Defaults to False.

Returns:

A dictionary containing a tuple with the empirical value of the metric, and the two-tailed p-value.

The expected format in the output in every dict entry is:

Two sided \(p\)-value
Observed difference (effect size)
Effect size CI low
Effect size CI high
\(M(y_{gt}, \hat y_{1})\)
\(M(y_{gt}, \hat y_{1})_{(\alpha / 2)}\)
\(M(y_{gt}, \hat y_{1})_{(1 - \alpha / 2)}\)
\(M(y_{gt}, \hat y_{1})\)
\(M(y_{gt}, \hat y_{2})_{(\alpha / 2)}\)
\(M(y_{gt}, \hat y_{2})_{(1 - \alpha / 2)}\)

Return type:

Dict[Tuple[float]]

stambo.to_latex(report: Dict[str, Tuple[float]], m1_name: str | None = 'M1', m2_name: str | None = 'M2', n_digits: int = 2) → str[source]

Converts a report returned by StamBO into a LaTeX table for convenient viewing.

Note: The alternative hypothesis is that the second model is different from the first model. The p-value is the two-tailed p-value.

Parameters:

report (Dict[str, Tuple[float]]) – A dictionary with metrics. Use the stambo-generated format.
m1 (str, optional) – Name to assign to the table row. Defaults to M1.
m2 (str, optional) – Name to assign to the table row. Defaults to M2.
n_digits (int, optional) – Number of digits to round to. Defaults to 2.

Returns:

A cut-and-paste LaTeX table in tabular environment.

Return type:

str

stambo.two_sample_test(sample_1: ndarray[Any, dtype[int64]] | ndarray[Any, dtype[float64]] | PredSampleWrapper, sample_2: ndarray[Any, dtype[int64]] | ndarray[Any, dtype[float64]] | PredSampleWrapper, statistics: Dict[str, Callable], groups: ndarray[Any, dtype[int64]] | None = None, alpha: float = 0.05, n_bootstrap: int = 10000, seed: int | None = None, non_paired: bool = False, silent: bool = False) → Dict[str, Tuple[float]][source]

Compares whether the empirical difference of statistics computed own two samples is statistically significant or not. Note that the statistics are computed independently, and should thus be treated independently.

Parameters:

sample_1 (Union[npt.NDArray[np.int64], npt.NDArray[np.float64]) – Sample 1 to be compared
sample_2 (Union[npt.NDArray[np.int64], npt.NDArray[np.float64]) – Sample 2 to be compared
groups (Optional[npt.NDArray[np.int64]]) – Groups indicating the subject for each measurement.
statistics (Dict[str, Callable]) – Statistics to compare the samples by.
alpha (float, optional) – A significance level for confidence intervals (from 0 to 1).
n_bootstrap (int, optional) – The number of bootstrap iterations. Defaults to 10000.
non_paired (bool, optional) – Whether to use a non-paired design. Defaults to False.
seed (int, optional) – Random seed. Defaults to None.
silent (bool, optional) – Whether to execute the function silently, i.e. not showing the progress bar. Defaults to False.

Returns:

A dictionary containing a tuple with the empirical value of the metric, and the p-value.

The expected format in the output in every dict entry is:

two-sided p-value
observed difference (effect size)
CI low (effect size)
CI high (effect size)
empirical value (sample 1),
CI low (sample 1)
CI high (sample 1)
empirical value (sample 2),
CI low (sample 2)
CI high (sample 2).

Return type:

Dict[Tuple[float]]