Comparing two samples using stambo

Binder

V1.1.4: © Aleksei Tiulpin, PhD, 2025

There are many cases, when we develop models other than classification or regression, and we want to compute scores per datapoint, and then find their mean. We may often want to compare just two samples of measurements, and stambo allows to do this easily too.

This example shows how to conduct a simple two-sample test. The example is synthetic, and we will just simply generate two Gaussian samples, and assess whether the mean of the second sample is greater than the mean of the first sample.

Importing the libraries

[1]:
import numpy as np
import stambo
import matplotlib.pyplot as plt

SEED=2025

stambo.__version__
[1]:
'0.1.4'

Data generation

[2]:
np.random.seed(SEED)
n_samples = 50
sample_1 = np.random.randn(n_samples)
sample_2 = np.random.randn(n_samples)+0.7 # effect size is 0.7

Sample comparison

Note that when it comes to a two-sample test, stambo does not require the statistic of choice to be a machine learning metric that is a subclass of stambo.metrics.Metric. Note that we have a non-paired design here (a regular two-sample test), so, it needs to be specified.

[3]:
res = stambo.two_sample_test(sample_1, sample_2, statistics={"Mean": lambda x: x.mean()}, non_paired=True)
Bootstrapping: 100%|██████████| 10000/10000 [00:00<00:00, 58539.44it/s]
[4]:
res
[4]:
{'Mean': array([ 0.18418158,  0.23771615, -0.12154188,  0.58512178,  0.08127468,
        -0.16881557,  0.33611198,  0.31899082,  0.0732373 ,  0.56617054])}

LaTeX report

[5]:
print(stambo.to_latex(res, m1_name="Control", m2_name="Treated"))
% \usepackage{booktabs} <-- do not forget to have this imported.
\begin{tabular}{ll} \\
\toprule
\textbf{Model} & \textbf{Mean} \\
\midrule
Control & $0.08$ [$-0.17$-$0.34$] \\
Treated & $0.32$ [$0.07$-$0.57$] \\
\midrule
Effect size & $0.24$ [$-0.12$-$0.59]$ \\
\midrule
$p$-value & $0.18$ \\
\bottomrule
\end{tabular}

This example shows that for the data at hand we did not have samples to reject the null.

Grouped data

A very common scenario we should consider is when data is grouped. For example, we have multiple datapoints from the same subjects.

Below is the simulation code that generates synthetic data for case when we have multiple datapoints from the same subject. To see if your data is grouped, you can plot a covaraince matrix of your data.

Here, we sample from a distribution of estimators of the covariance matrices to reveal the block structure. Here’s a breakdown of the sampling process: - M: Total number of subjects in the study. - N: Total number of measurements across all subjects. - dim: Dimensionality of each measurement (e.g., if measurements are scalar, dim=1). - mean_range: The range from which each subject’s mean measurement value is randomly sampled. - base_cov: A covariance matrix that defines how measurements within the same subject may vary around the subject’s mean.

The process of data simulation involves: 1. Generating a specified number of measurements for each subject using a multinomial distribution, ensuring each subject has at least one measurement 2. Simulating the actual measurements for each subject based on their unique mean and a structured covariance matrix. This is repeated 1000 times to average out random fluctuations and get a stable estimate of the covariance structure across subjects.

[6]:
# Example usage
M = 10  # Number of subjects
N = 100  # Total measurements
dim = 1  # Dimensionality of each measurement

mean_range = np.array((-5, 0))  # Each subject's mean is sampled from this range
base_cov = np.ones((dim, dim))  # Example structured covariance

res = 0
np.random.seed(SEED)
n_measurements_per_subj = stambo.synthetic.generate_n_measurements(M, N, random=True, d_temp=2)

# How many times to regenerate the dataset
n_samples_cov = 1000
for i in range(n_samples_cov):
    data, subject_ids, means = stambo.synthetic.generate_non_iid_measurements(n_measurements_per_subj, mean_range, base_cov)
    res += data @ data.T

res /= n_samples_cov
[7]:
plt.figure(figsize=(4, 4))
plt.imshow(res)
plt.title("Subject-Level Covariance Structure")
plt.show()

for s_id in range(M):
    n_per_i = data[subject_ids == s_id].shape[0]
    print(f"Data for subject {s_id}: {n_per_i}")
_images/Two_sample_test_13_0.png
Data for subject 0: 5
Data for subject 1: 11
Data for subject 2: 4
Data for subject 3: 17
Data for subject 4: 18
Data for subject 5: 14
Data for subject 6: 12
Data for subject 7: 11
Data for subject 8: 1
Data for subject 9: 7

Grouped data & paired design

Let us now cover another common case: we have a number of subjects with multiple measurements taken before and after treatment. We want to estimat the treatment effect. This case is highly related to comparing two models. Model 1 could be seen as control, and improvements done on it (Model 2) could be seen as treatment.

Here, we have developed functionality to generate data from a paired design. We can easily use stamo’s API to specify groups.

[8]:
paired_data, s_ids_paired = stambo.synthetic.generate_paired_data(n_subjects=3, n_measurements=40, effect_size=0.5, dim=1, mean_range=(0, 0.5), seed=SEED)
[9]:
plt.figure(figsize=(6, 2))
plt.hist(paired_data[:, 0], label="Control", color="blue", alpha=0.5)
plt.hist(paired_data[:, 1], label="Treatment", color="red", alpha=0.5)
plt.legend()
plt.show()

_images/Two_sample_test_16_0.png

Let us now first run the test without specifying the groups.

[10]:
res = stambo.two_sample_test(paired_data[:, 0], paired_data[:, 1], statistics={"Mean": lambda x: x.mean()}, seed=SEED)
print(stambo.to_latex(res, m1_name="Control", m2_name="Treatment"))
Bootstrapping: 100%|██████████| 10000/10000 [00:00<00:00, 50639.82it/s]
% \usepackage{booktabs} <-- do not forget to have this imported.
\begin{tabular}{ll} \\
\toprule
\textbf{Model} & \textbf{Mean} \\
\midrule
Control & $0.36$ [$-0.10$-$0.83$] \\
Treatment & $0.82$ [$0.46$-$1.21$] \\
\midrule
Effect size & $0.46$ [$-0.07$-$1.00]$ \\
\midrule
$p$-value & $0.09$ \\
\bottomrule
\end{tabular}

What we can se above, is that the test does not detect an effect and the 95% on the effect size are very wide.

[11]:
res = stambo.two_sample_test(paired_data[:, 0], paired_data[:, 1], statistics={"Mean": lambda x: x.mean()}, groups=s_ids_paired, seed=SEED)
# Let us add more digits to see the p-value clearer
print(stambo.to_latex(res, m1_name="Control", m2_name="Treatment", n_digits=4))
Bootstrapping: 100%|██████████| 10000/10000 [00:00<00:00, 57055.19it/s]
% \usepackage{booktabs} <-- do not forget to have this imported.
\begin{tabular}{ll} \\
\toprule
\textbf{Model} & \textbf{Mean} \\
\midrule
Control & $0.3640$ [$0.2471$-$0.5008$] \\
Treatment & $0.8246$ [$0.7465$-$0.9272$] \\
\midrule
Effect size & $0.4607$ [$0.3419$-$0.5721]$ \\
\midrule
$p$-value & $0.0002$ \\
\bottomrule
\end{tabular}

Once we have specified the groups, we get very tight 95% CIs around the true effect (0.5), showing that it is really positive. Moreover, we are able to reject the null, since our bootstrap procedure closely resembles the data generating process.