Comparing two classification models using `stambo`

V1.1.3: © Aleksei Tiulpin, PhD, 2025

This notebook shows an end-to-end example on how one can take a dataset, train two machine learning models, and conduct a statistical test to assess whether the two models are different. We will first use a set of classical metrics (basically the metrics from sklearn). At the end of the tutorial, we will show how one can generate a LaTeX report, and implement a custom metric.

Import of necessary libraries

[1]:

import stambo

from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score

SEED = 2025

stambo.__version__

[1]:

'0.1.4'

Loading the UCI breast cancer dataset and creating train-test split

[2]:

X, y = load_breast_cancer(return_X_y=True)
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.5, random_state=SEED, stratify=y)

scaler = StandardScaler()
scaler.fit(Xtr)

Xtr = scaler.transform(Xtr)
Xte = scaler.transform(Xte)

Training the models

We train a kNN and a logistic regression. Here, we can see that the logistic regression outperformes the kNN.

[3]:

model = KNeighborsClassifier(n_neighbors=3)
model.fit(Xtr, ytr)
preds_knn = model.predict_proba(Xte)[:, 1]

model = LogisticRegression(C=1e-2, random_state=42)
model.fit(Xtr, ytr)
preds_lr = model.predict_proba(Xte)[:, 1]


auc_knn, auc_lr = roc_auc_score(yte, preds_knn), roc_auc_score(yte, preds_lr)
print(f"kNN AUC: {auc_knn:.4f} / LR AUC: {auc_lr:.4f}")

kNN AUC: 0.9728 / LR AUC: 0.9888

Statistical testing

As stated in the documentation, the testing routine returns the dict of tuple. The keys in the dict are the metric tags, and the values are tuples that store the data in the following format:

p-value (\(H_0: model_1 = model_2\))
Empirical value (model 1)
CI low (model 1)
CI high (model 1)
Empirical value (model 2)
CI low (model 2)
CI high (model 2)

If you launch the code in Binder, decrease the number of bootstrap iterations (10000 by default).

[4]:

testing_result = stambo.compare_models(yte, preds_knn, preds_lr, metrics=("ROCAUC", "AP", "QKappa", "BACC", "MCC"), seed=SEED, n_bootstrap=1000)

Bootstrapping: 100%|██████████| 1000/1000 [00:03<00:00, 279.93it/s]

If we want to visualize the testing results, they are available in a dict in the format we have described above:

[5]:

testing_result

[5]:

{'ROCAUC': array([0.02597403, 0.01607463, 0.00116143, 0.03622366, 0.97275219,
        0.94904937, 0.99252755, 0.98882682, 0.97478143, 0.99825874]),
 'AP': array([0.003996  , 0.02314431, 0.00502882, 0.04678529, 0.9689868 ,
        0.94193877, 0.99255722, 0.99213111, 0.9810929 , 0.99893794]),
 'QKappa': array([ 0.08591409, -0.0320894 , -0.07893126,  0.00764438,  0.90810898,
         0.85387269,  0.95589713,  0.87601958,  0.81299213,  0.93027112]),
 'BACC': array([ 0.06393606, -0.02079161, -0.04718104,  0.00252861,  0.94531991,
         0.91380672,  0.97414673,  0.9245283 ,  0.88886054,  0.95691282]),
 'MCC': array([ 0.19180819, -0.02795235, -0.07123844,  0.0096027 ,  0.91078326,
         0.86011542,  0.95680034,  0.88283091,  0.82759216,  0.93254094])}

Most commonly, we though want to visualize them in a report, paper, or a presentation. For that, we can use a function to_latex, and get a cut-and-paste tabular. To use it in a LaTeX document, one needs to not forget to import booktabs

[6]:

print(stambo.to_latex(testing_result, m1_name="kNN", m2_name="LR"))

% \usepackage{booktabs} <-- do not forget to have this imported.
\begin{tabular}{llllll} \\
\toprule
\textbf{Model} & \textbf{ROCAUC} & \textbf{AP} & \textbf{QKappa} & \textbf{BACC} & \textbf{MCC} \\
\midrule
kNN & $0.97$ [$0.95$-$0.99$] & $0.97$ [$0.94$-$0.99$] & $0.91$ [$0.85$-$0.96$] & $0.95$ [$0.91$-$0.97$] & $0.91$ [$0.86$-$0.96$] \\
LR & $0.99$ [$0.97$-$1.00$] & $0.99$ [$0.98$-$1.00$] & $0.88$ [$0.81$-$0.93$] & $0.92$ [$0.89$-$0.96$] & $0.88$ [$0.83$-$0.93$] \\
\midrule
Effect size & $0.02$ [$0.00$-$0.04]$ & $0.02$ [$0.01$-$0.05]$ & $-0.03$ [$-0.08$-$0.01]$ & $-0.02$ [$-0.05$-$0.00]$ & $-0.03$ [$-0.07$-$0.01]$ \\
\midrule
$p$-value & $0.03$ & $0.00$ & $0.09$ & $0.06$ & $0.19$ \\
\bottomrule
\end{tabular}

Own metrics

Sometimes, having default metrics is not enough, and one may want to have some additional metrics. Let us define an F2 score.

[7]:

from sklearn.metrics import fbeta_score
from functools import partial
from stambo.metrics import Metric

[8]:

class F2Score(Metric):
    def __init__(self) -> None:
        Metric.__init__(self, partial(fbeta_score, beta=2), int_input=True)

    def __str__(self) -> str:
        return "F2Score"

[9]:

testing_result = stambo.compare_models(yte, preds_knn, preds_lr,
                                       ("ROCAUC", "AP", F2Score()),seed=SEED)

Bootstrapping: 100%|██████████| 10000/10000 [00:27<00:00, 369.94it/s]

[10]:

print(stambo.to_latex(testing_result, m1_name="kNN", m2_name="LR"))

% \usepackage{booktabs} <-- do not forget to have this imported.
\begin{tabular}{llll} \\
\toprule
\textbf{Model} & \textbf{ROCAUC} & \textbf{AP} & \textbf{F2Score} \\
\midrule
kNN & $0.97$ [$0.95$-$0.99$] & $0.97$ [$0.94$-$0.99$] & $0.98$ [$0.97$-$0.99$] \\
LR & $0.99$ [$0.98$-$1.00$] & $0.99$ [$0.98$-$1.00$] & $0.98$ [$0.97$-$0.99$] \\
\midrule
Effect size & $0.02$ [$0.00$-$0.03]$ & $0.02$ [$0.00$-$0.05]$ & $-0.00$ [$-0.01$-$0.01]$ \\
\midrule
$p$-value & $0.04$ & $0.01$ & $0.82$ \\
\bottomrule
\end{tabular}

[11]:

testing_result

[11]:

{'ROCAUC': array([3.73962604e-02, 1.60746284e-02, 6.65903429e-04, 3.46571565e-02,
        9.72752187e-01, 9.49497066e-01, 9.92339013e-01, 9.88826816e-01,
        9.75220895e-01, 9.97773542e-01]),
 'AP': array([0.00659934, 0.02314431, 0.00493627, 0.04595217, 0.9689868 ,
        0.94205306, 0.9918646 , 0.99213111, 0.9814529 , 0.99873236]),
 'F2Score': array([ 8.22917708e-01, -9.88531818e-04, -1.01762084e-02,  1.06433115e-02,
         9.83425414e-01,  9.70654628e-01,  9.93265993e-01,  9.82436883e-01,
         9.72877358e-01,  9.90415335e-01])}

[12]:

print(stambo.to_latex(testing_result, m1_name="kNN", m2_name="LR"))

% \usepackage{booktabs} <-- do not forget to have this imported.
\begin{tabular}{llll} \\
\toprule
\textbf{Model} & \textbf{ROCAUC} & \textbf{AP} & \textbf{F2Score} \\
\midrule
kNN & $0.97$ [$0.95$-$0.99$] & $0.97$ [$0.94$-$0.99$] & $0.98$ [$0.97$-$0.99$] \\
LR & $0.99$ [$0.98$-$1.00$] & $0.99$ [$0.98$-$1.00$] & $0.98$ [$0.97$-$0.99$] \\
\midrule
Effect size & $0.02$ [$0.00$-$0.03]$ & $0.02$ [$0.00$-$0.05]$ & $-0.00$ [$-0.01$-$0.01]$ \\
\midrule
$p$-value & $0.04$ & $0.01$ & $0.82$ \\
\bottomrule
\end{tabular}

Comparing two classification models using stambo