Univariate Model¶
This tutorial demonstrates how to use the functionalities implemented in SpectraNorm to fit a univariate normative model to a single feature (i.e. a direct normative model). The tutorial covers data preparation, model fitting, and basic interpretation of results.
Importing Libraries¶
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy import stats
# Import the SpectraNorm package
from spectranorm import snm
Data Preparation¶
Here, we will generate synthetic data to illustrate the process of fitting a univariate normative model. In practice, you would replace this with loading your actual dataset.
# Fix the random seed for reproducibility
np.random.seed(42)
sample_size = 5000
data = pd.DataFrame({
"age": np.random.uniform(20, 80, size=sample_size),
"site": np.random.choice(["A", "B", "C", "D", "E", "F"], size=sample_size),
})
# Generate a synthetic phenotype
# Let's assume the phenotype is nonlinearly related to age,
# and site effect comes from a normal distribution with mean depending
# on the site and some random noise
site_effects_prior = np.random.normal(0, 5, size=data["site"].nunique())
site_effects = data["site"].map(dict(zip(data["site"].unique(), site_effects_prior)))
phenotype_mean = 0.01 * (
400 + 1.5 * data["age"] -
0.01 * (data["age"] - 50)**3 +
10 * site_effects)
# Heteroscedasticity: variance increases with age
phenotype_std = 0.01 * (10 + 0.8 * data["age"])
data["phenotype"] = np.random.normal(phenotype_mean, phenotype_std)
data.head()
| age | site | phenotype | |
|---|---|---|---|
| 0 | 42.472407 | C | 4.638801 |
| 1 | 77.042858 | C | 4.186396 |
| 2 | 63.919637 | F | 3.992819 |
| 3 | 55.919509 | D | 5.027764 |
| 4 | 29.361118 | C | 5.115786 |
The synthetic data above simulates a scenario where we have a phenotype that is influenced by age and site, with some added noise. The site variable is categorical, representing different data collection sites, while age is a continuous variable. The phenotype is the target variable we want to model.
The visualization below illustrates the relationship between age and the phenotype across different sites, showing how the phenotype varies with age and site.
plt.figure(figsize=(10, 6))
for site in sorted(data["site"].unique()):
subset = data[data["site"] == site]
plt.scatter(subset["age"], subset["phenotype"], alpha=0.5, label=f"{site}", s=7)
plt.xlabel("Age")
plt.ylabel("Phenotype")
plt.title("Phenotype vs Age by Site")
plt.legend(title="Site")
plt.show()
Minimal model fitting example¶
The example below demonstrates how to fit a univariate normative model using snm.DirectNormativeModel. We will use the synthetic data generated in the previous section.
We start with the simplest way of defining and fitting a univariate normative model, in which most of the fitting parameters are set to their default values. This is a good starting point for users who are new to normative modeling and want to quickly fit a model without delving into the details of the fitting process.
Defining the model¶
In this example, we define a univariate normative model directly from the training data. The DirectNormativeModel.from_dataframe method allows us to specify the target variable, covariates, and other parameters directly from a pandas DataFrame. This approach is straightforward and does not require manual specification of the model specification and priors.
# Define the normative model:
normative_model = snm.DirectNormativeModel.from_dataframe(
# Hierarchical Bayesian Regression (use a hierarchical prior on batch effects)
model_type="HBR",
# Pass the dataframe containing all covariates and the target variable of interest
dataframe=data,
# Specify the name of the target variable (the variable we want to model)
variable_of_interest="phenotype",
# Specify the numerical covariates to include in the model
numerical_covariates=["age"],
# Specify which numerical covariates are modeled as nonlinear effects (B-splines)
nonlinear_covariates=["age"],
# Specify the categorical covariates included in the model (including batch effects)
categorical_covariates=["site"],
# Specify whether any categorical covariates should be treated as batch effects
batch_covariates=["site"],
# Specify which covariates influence the mean of the target variable
influencing_mean=["age", "site"],
# Specify which covariates influence the variance of the target variable
influencing_variance=["age", "site"],
)
normative_model
DirectNormativeModel(spec=NormativeModelSpec(variable_of_interest='phenotype', covariates=[CovariateSpec(name=site, cov_type=categorical, hierarchical=True, n_categories=6), CovariateSpec(name=age, cov_type=numerical, effect=spline)], influencing_mean=['age', 'site'], influencing_variance=['age', 'site']))
Notably, the minimal definition above, will fit the phenotype $y$ as a function of age and site, with the following model specification:
$$ y \sim \mathcal{N}(\mu, \sigma) $$ $$ \mu = f_{\mu}(\Phi(\text{age}), \text{site}) $$ $$ \sigma = f^+_{\sigma}(\Phi(\text{age}), \text{site}) $$
Where $\Phi(\text{age})$ represents the B-spline basis expansion of the age covariate, and $f_{\mu}$ and $f^+_{\sigma}$ are the functions that model the mean and standard deviation of the phenotype as a generalized additive model of the covariates.
Fitting the model¶
Now that the normative model has been defined, we can proceed to fit the model to the data. The DirectNormativeModel.fit method uses a prespecified model instance to estimate the parameters of the model based on the provided data and covariates. After fitting, we can examine the results to understand how well the model captures the relationship between the phenotype and the covariates.
# Fit the normative model to the training data
normative_model.fit(train_data=data)
Output()
Finished [100%]: Average Loss = 3,115.5
Examining the fitted model¶
Now that the normative model has been fitted, we can examine the results to understand how well the model captures the relationship between the phenotype and the covariates.
Visualizing the model fit¶
Here, we will look at the estimated normative ranges describing the trend by which the phenotype changes with age. Assuming that we are not interested in site effects, we can marginalize over the site covariate to obtain a single normative range that describes the expected phenotype values across age, regardless of site (i.e. partialing out the site effects). The plot below shows the estimated normative range with different shades indicating different centiles of the normative distribution.
fig, ax = plt.subplots(1, 1, figsize=(12, 8))
# 100 evenly spaced age values for prediction
xi = np.linspace(20, 80, 100)
# create dummy dataframe for prediction
dummy_df = pd.DataFrame({
"age": xi,
})
# Predict moments (mean and std) of the normative distribution
moments = normative_model.predict(dummy_df, predict_without=['site']).predictions
# Let's compute charts along different quantiles
quantiles = [0.001, 0.01, 0.05, 0.25, 0.50, 0.75, 0.95, 0.99, 0.999]
# Predict the phenotype values corresponding to different quantiles
phenotype_quantiles = {
q: (moments["mu_estimate"] + stats.norm.ppf(q=q) * moments["std_estimate"])
for q in quantiles
}
# Some colors for the plot
c1 = "#448e9c" # brighter shade
c2 = "#038da6" # darker shade
# Plotting shaded ranges for different quantiles
for q1, q2, alpha in [
(0.001, 0.999, 0.05),
(0.01, 0.99, 0.1),
(0.05, 0.95, 0.2),
(0.25, 0.75, 0.3)]:
ax.fill_between(
xi,
phenotype_quantiles[q1],
phenotype_quantiles[q2],
alpha=alpha,
color=c1,
)
# Now add boundary lines with different colors
ax.plot(xi, phenotype_quantiles[q1], color=c1, lw=0.5, linestyle=":")
ax.plot(xi, phenotype_quantiles[q2], color=c1, lw=0.5, linestyle=":")
# Finally, add the median line with a different color and higher line width
ax.plot(
xi,
phenotype_quantiles[0.5],
c=c2,
lw=4,
)
plt.xlabel("Age")
plt.ylabel("Phenotype")
plt.title("Phenotype vs Age");
The plot shows how the expected phenotype values change with age, along with the variability around the expected values. Visually, the fitted normative model looks reasonable, as it captures the general trend of the data.
Notably, given that this is synthetic data, for which we know the ground truth distribution, we can also compare the estimated normative range to the true underlying distribution of the data. This allows us to directly assess how well the fitted model captures the true (synthetic) relationship between age and the phenotype.
While this validation step is not possible with real data, it is a useful exercise to understand the capabilities and limitations of the normative modeling approach using synthetic data, where we have access to the ground truth distribution.
# A figure with two subplots:
# one showing the fitted normative model with shaded quantile ranges, and
# another showing the true underlying distribution of the synthetic data for comparison
fig, axes = plt.subplots(1, 2, figsize=(18, 8))
# First axis: Fitted normative model with shaded quantile ranges
ax = axes[0]
# 100 evenly spaced age values for prediction
xi = np.linspace(20, 80, 100)
# create dummy dataframe for prediction
dummy_df = pd.DataFrame({
"age": xi,
})
# Predict moments (mean and std) of the normative distribution
moments = normative_model.predict(dummy_df, predict_without=['site']).predictions
# Let's compute charts along different quantiles
quantiles = [0.001, 0.01, 0.05, 0.25, 0.50, 0.75, 0.95, 0.99, 0.999]
# Predict the phenotype values corresponding to different quantiles
phenotype_quantiles = {
q: (moments["mu_estimate"] + stats.norm.ppf(q=q) * moments["std_estimate"])
for q in quantiles
}
# Some colors for the plot
c1 = "#448e9c" # brighter shade
c2 = "#038da6" # darker shade
# Plotting shaded ranges for different quantiles
for q1, q2, alpha in [
(0.001, 0.999, 0.05),
(0.01, 0.99, 0.1),
(0.05, 0.95, 0.2),
(0.25, 0.75, 0.3)]:
ax.fill_between(
xi,
phenotype_quantiles[q1],
phenotype_quantiles[q2],
alpha=alpha,
color=c1,
)
# Now add boundary lines with different colors
ax.plot(xi, phenotype_quantiles[q1], color=c1, lw=0.5, linestyle=":")
ax.plot(xi, phenotype_quantiles[q2], color=c1, lw=0.5, linestyle=":")
# Finally, add the median line with a different color and higher line width
ax.plot(
xi,
phenotype_quantiles[0.5],
c=c2,
lw=4,
)
ax.set_xlabel("Age")
ax.set_ylabel("Phenotype")
ax.set_title("Phenotype vs Age (Fitted Normative Model)")
# Second axis: True underlying distribution of the synthetic data
# (after disregarding site effects)
ax = axes[1]
synthetic_mean = 0.01 * (
400 + 1.5 * xi -
0.01 * (xi - 50)**3)
synthetic_std = 0.01 * (10 + 0.8 * xi)
# Compute true quantiles of the synthetic distribution
synthetic_quantiles = {
q: (synthetic_mean + stats.norm.ppf(q=q) * synthetic_std)
for q in quantiles
}
# Some colors for the plot
c1 = "#609c44" # brighter shade
c2 = "#0ea603" # darker shade
# Plotting shaded ranges for different quantiles
for q1, q2, alpha in [
(0.001, 0.999, 0.05),
(0.01, 0.99, 0.1),
(0.05, 0.95, 0.2),
(0.25, 0.75, 0.3)]:
ax.fill_between(
xi,
synthetic_quantiles[q1],
synthetic_quantiles[q2],
alpha=alpha,
color=c1,
)
# Now add boundary lines with different colors
ax.plot(xi, synthetic_quantiles[q1], color=c1, lw=0.5, linestyle=":")
ax.plot(xi, synthetic_quantiles[q2], color=c1, lw=0.5, linestyle=":")
# Finally, add the median line with a different color and higher line width
ax.plot(
xi,
synthetic_quantiles[0.5],
c=c2,
lw=4,
)
ax.set_xlabel("Age")
ax.set_ylabel("Phenotype")
ax.set_title("Phenotype vs Age (ground truth)")
plt.tight_layout()
As we can see from the plot, the estimated normative range accurately captures the true underlying distribution of the data, indicating that the fitted model is able to capture the relationship between age and the phenotype effectively.
Quantitative evaluation of model fit¶
Above and beyond the visual inspection, we can also evaluate the model fit quantitatively by looking at the estimated parameters and how well the model captures the variability in the training data. This can be done by examining the estimated predictions (using either the DirectNormativeModel.predict & NormativePredictions.evluate_predictions methods or by directly using DirectNormativeModel.evaluate method):
# First, we'll predict the normative ranges for the training data:
# Note that the predict method will only require the covariates as input
train_predictions = normative_model.predict(data[["age", "site"]])
# Now we can evaluate how well the model captures the variability in the target variable
train_predictions.evaluate_predictions(
variable_of_interest=data["phenotype"].to_numpy(),
train_mean=normative_model.model_params["mean_VOI"],
train_std=normative_model.model_params["std_VOI"],
n_params=normative_model.model_params["n_params"],
).evaluations
{'MAE': array(0.39788493),
'MSE': array(0.26895753),
'RMSE': array(0.51861116),
'MAPE': array(10.17445066),
'R-squared': array(0.77135881),
'Explained Variance': array(0.77135901),
'MSLL': array(-0.78767534)}
# Alternatively, we can simply use the model's built-in method to evaluate
# goodness-of-fit with a one-liner (note that the target variable is passed to
# the evaluate method):
normative_model.evaluate(data).evaluations
{'MAE': array(0.39788493),
'MSE': array(0.26895753),
'RMSE': array(0.51861116),
'MAPE': array(10.17445066),
'R-squared': array(0.77135881),
'Explained Variance': array(0.77135901),
'MSLL': array(-0.78767534)}
Extending the model to unseen data¶
Now that we have fitted the model to the training data, we can also apply the fitted model to new, unseen data to obtain predictions and evaluate how well the model generalizes. This is an important step in normative modeling, as it allows us to assess the model's performance on data that was not used during training.
This can verify whether the model has possibly overfitted to the training data or if it captures generalizable patterns that can be applied to new data.
Let's generate some new synthetic data that was not used during training, but follows the same underlying distribution. We can then use the fitted model to make predictions on this new data and evaluate the results.
# Fix the random seed for reproducibility
np.random.seed(42)
test_sample_size = 5000
test_data = pd.DataFrame({
"age": np.random.uniform(20, 80, size=test_sample_size),
"site": np.random.choice(["A", "B", "C", "D", "E", "F"], size=test_sample_size),
})
# Generate the synthetic phenotype
# We'll use the same underlying distribution as before
test_site_effects = test_data["site"].map(
dict(zip(data["site"].unique(), site_effects_prior)))
test_phenotype_mean = 0.01 * (
400 + 1.5 * test_data["age"] -
0.01 * (test_data["age"] - 50)**3
+ 10 * test_site_effects)
test_phenotype_std = 0.01 * (10 + 0.8 * test_data["age"])
test_data["phenotype"] = np.random.normal(test_phenotype_mean, test_phenotype_std)
Quantitative evaluation of out-of-sample predictions¶
Similar to the evaluation of in-sample predictions, we can also evaluate the out-of-sample predictions quantitatively by looking at the estimated parameters and how well the model captures the variability in the new data. This can be done by examining the estimated predictions on the new data and comparing them to the actual values.
# We'll use the evaluate method for brevity here,
# but we can also break it down to prediction + evaluation
# similar to the training data if we want
test_predictions = normative_model.evaluate(test_data)
test_predictions.evaluations
{'MAE': array(0.3962901),
'MSE': array(0.26426399),
'RMSE': array(0.51406614),
'MAPE': array(9.63360171),
'R-squared': array(0.77598759),
'Explained Variance': array(0.77599224),
'MSLL': array(-0.79278783)}
As it can be seen from the evaluation results, the model captures the general trend in the new data, as the performance metrics indicate that the model is able to explain a significant portion of the variance in the new data (77%), and the metrics are largely consistent with the in-sample evaluation, suggesting that the model generalizes well to new data with minimal overfitting.
Visualizing out-of-sample predictions (with harmonization)¶
Notably, 'SpectraNorm' also provides the capability to harmonize the target variable by removing the estimated batch effects (i.e. site effects in this case) from the phenotype values. This allows us to visualize the relationship between age and the phenotype without the confounding influence of site effects, which can be particularly useful when we are interested in understanding the underlying relationship between age and the phenotype across different sites.
The script below demonstrates how to compute the harmonized phenotype values by removing the estimated site effects from the observed phenotype values:
# Let's first compute the harmonized phenotype values by removing the estimated
# site effects from the observed phenotype values
test_data["harmonized_phenotype"] = normative_model.harmonize(
# The dataframe containing the covariates and the target variable
data=test_data,
# The list of covariates to harmonize (partial out)
covariates_to_harmonize=["site"],
)
Combining the harmonized phenotype values as scatter points with the estimated normative range allows us to visualize how the new data points relate to the normative model while accounting for site effects. This can provide insights into whether the new data points fall within the expected normative range after removing site-related variability, and can help identify any potential outliers or deviations from the normative pattern that may be of interest for further investigation.
# The first part of the script is copied from the previous visualization section
# this part visualizes the normative range of the phenotype values across age.
fig, ax = plt.subplots(1, 1, figsize=(12, 8))
# 100 evenly spaced age values for prediction
xi = np.linspace(20, 80, 100)
# create dummy dataframe for prediction
dummy_df = pd.DataFrame({
"age": xi,
})
# Predict moments (mean and std) of the normative distribution
moments = normative_model.predict(dummy_df, predict_without=['site']).predictions
# Let's compute charts along different quantiles
quantiles = [0.001, 0.01, 0.05, 0.25, 0.50, 0.75, 0.95, 0.99, 0.999]
# Predict the phenotype values corresponding to different quantiles
phenotype_quantiles = {
q: (moments["mu_estimate"] + stats.norm.ppf(q=q) * moments["std_estimate"])
for q in quantiles
}
# Some colors for the plot
c1 = "#448e9c" # brighter shade
c2 = "#038da6" # darker shade
# Plotting shaded ranges for different quantiles
for q1, q2, alpha in [
(0.001, 0.999, 0.05),
(0.01, 0.99, 0.1),
(0.05, 0.95, 0.2),
(0.25, 0.75, 0.3)]:
ax.fill_between(
xi,
phenotype_quantiles[q1],
phenotype_quantiles[q2],
alpha=alpha,
color=c1,
)
# Now add boundary lines with different colors
ax.plot(xi, phenotype_quantiles[q1], color=c1, lw=0.5, linestyle=":")
ax.plot(xi, phenotype_quantiles[q2], color=c1, lw=0.5, linestyle=":")
# Finally, add the median line with a different color and higher line width
ax.plot(
xi,
phenotype_quantiles[0.5],
c=c2,
lw=4,
)
# In addition to the part above, the following add the harmonized phenotype values
# as scatter points along the estimated normative range:
for site in sorted(test_data["site"].unique()):
subset = test_data[test_data["site"] == site]
plt.scatter(
subset["age"], subset["harmonized_phenotype"], alpha=0.5, label=f"{site}", s=7)
plt.xlabel("Age")
plt.ylabel("Phenotype")
plt.title("Phenotype vs Age by Site")
plt.legend(title="Site")
plt.show()
As seen in the plot, the harmonized phenotype values (i.e. after removing site effects) show a clearer relationship with age, and we can see how the new data points relate to the normative range without the confounding influence of site effects. For instance, note that the earlier scatter plot (before harmonization) indicated a clear site effect, with certain sites having visibly higher or lower phenotype values. After harmonization, this site effect is removed, allowing us to see the underlying relationship between age and the phenotype more clearly.
Concluding remarks¶
In summary, this tutorial demonstrates how to fit a univariate normative model using SpectraNorm, evaluate the model fit both in-sample and out-of-sample, and visualize the results while accounting for batch effects. This provides an introductory overview of the process of normative modeling and how to interpret the results effectively.