Reveal multiple level of biological heterogeneities in Mouse Hypothalamus by MERFISH-seq¶
In this tutorial, we will showcae the STEP for revealing multiple levels of biological heterogeneities in the Mouse Hypothalamus by MERFISH-seq. STEP has been designed to shed light on different aspects of cellular complexity: 1. Cell State Level Heterogeneity: STEP can capture unique cellular states based on genetic expressions in the mouse hypothalamus. 2. Spatial Domain Level Heterogeneity: STEP facilitates understanding of spatial patterns and relationships among cells by smoothing the cell state embeddings. 3. End-to-End (e2e) Mode for Direct Spatial Domain Identification: A streamlined process to directly identify and analyze spatial heterogeneity.
The aim of this tutorial is to guide users through the process of utilizing and understanding the capabilities of STEP, enabling users to dive deep into the biological intricacies of the mouse hypothalamus.
[1]:
import os
import anndata as ad
import scanpy as sc
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from step import scModel, stModel
sc.settings.figdir = './results/MERFISH'
sc.set_figure_params(dpi_save=300, figsize=(6, 4.5))
st_file = './data/merfish_animal1.h5ad'
/projects/82505004-e7a0-445f-ab3c-80d03c91438f/.cache/pypoetry/virtualenvs/step-Ajq_Bw_i-py3.10/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
Cell state level heterogeneity¶
For the identification of cell state level hetorogeneity single-cell resolution data, we use step.scModel
which will focus on the modeling of transcipts and elimination of the potential batch-effects (not presented in this data), ignoring the spatial context even though the data is spatially resovled.
Setup STEP object¶
Due to the limitation of the MERFISH technique, there’re only 161 genes across sections. Subsequently, we will use all these genes by setting the paramter n_top_genes=None
and set the parameter n_modules=20
to model the fewer gene modules.
[2]:
stepc = scModel(
file_path=st_file,
n_top_genes=None,
layer_key=None,
batch_key='Bregma',
class_key='Cell_class',
module_dim=30,
hidden_dim=64,
n_modules=20,
decoder_type='zinb',
n_dec_hid_layers=1,
)
Using all genes
Using given geneset
Adding count data to layer 'counts'
================Dataset Info================
Batch key: Bregma
Class key: Cell_class
Number of Batches: 5
Number of Classes: 15
Gene Expr: (28316, 161)
Batch Label: (28316,)
============================================
[3]:
stepc.adata.obs.head()
[3]:
Animal_ID | Animal_sex | Behavior | Bregma | Centroid_X | Centroid_Y | Cell_class | Neuron_cluster_ID | in_tissue | n_genes | |
---|---|---|---|---|---|---|---|---|---|---|
Cell_ID | ||||||||||
6c59b1ec-95df-4f25-882a-ac94d99b87d8 | 1 | Female | Naive | -0.04 | -3033.401378 | 2825.155877 | Endothelial 3 | NaN | True | 29 |
aa7ce410-c485-42c9-9c29-bc1d90a46e00 | 1 | Female | Naive | -0.04 | -3027.328655 | 2955.936449 | Inhibitory | I-5 | True | 61 |
8426329e-077b-4385-bd94-c9fdaeb89bd1 | 1 | Female | Naive | -0.04 | -3020.512953 | 2961.696710 | Inhibitory | I-5 | True | 54 |
a7e4e0de-8baa-4136-8cac-fa9bc759e7e8 | 1 | Female | Naive | -0.04 | -3017.397490 | 2995.915801 | Inhibitory | I-5 | True | 86 |
c55468d5-1b2c-4477-b4ad-3653f5bbda35 | 1 | Female | Naive | -0.04 | -3005.478613 | 2877.528929 | Endothelial 1 | NaN | True | 40 |
[4]:
stepc.run(epochs=400,
batch_size=2048,
split_rate=0.2,
tune_epochs=100,
beta1=0.1,
beta2=0.01,
)
================Dataset Info================
Batch key: Bregma
Class key: Cell_class
Number of Batches: 5
Number of Classes: 15
Gene Expr: (28316, 161)
Batch Label: (28316,)
============================================
Performing category random split
Training size for -0.24: 4434
Training size for -0.19: 4641
Training size for -0.14: 4740
Training size for -0.09: 4445
Training size for -0.04: 4390
train size: 22650
valid size: 5666
Current Mode: multi_batches: ['gene_expr', 'batch_label']
Current Mode: multi_batches: ['gene_expr', 'batch_label']
100%|██████████| 400/400 [05:15<00:00, 1.27epoch/s, kl_loss=2.391, recon_loss=160.094, val_kl_loss=2.473/2.288, val_recon_loss=154.405/153.951]
Current Mode: multi_batches: ['gene_expr', 'batch_label']
Current Mode: multi_batches: ['gene_expr', 'batch_label']
Current Mode: multi_batches_with_ct: ['gene_expr', 'class_label', 'batch_label']
================Dataset Info================
Batch key: Bregma
Class key: Cell_class
Number of Batches: 5
Number of Classes: 15
Gene Expr: (28316, 161)
Class Label: (28316,)
Batch Label: (28316,)
============================================
Performing category random split
Training size for Astrocyte: 2724
Training size for Endothelial 1: 1184
Training size for Endothelial 2: 115
Training size for Endothelial 3: 439
Training size for Ependymal: 1029
Training size for Excitatory: 4629
Training size for Inhibitory: 9895
Training size for Microglia: 475
Training size for OD Immature 1: 809
Training size for OD Immature 2: 25
Training size for OD Mature 1: 190
Training size for OD Mature 2: 922
Training size for OD Mature 3: 11
Training size for OD Mature 4: 37
Training size for Pericytes: 163
train size: 22647
valid size: 5669
Current Mode: multi_batches_with_ct: ['gene_expr', 'class_label', 'batch_label']
Current Mode: multi_batches_with_ct: ['gene_expr', 'class_label', 'batch_label']
32%|███▏ | 32/100 [00:31<01:04, 1.05epoch/s, cl_loss=0.036, kl_loss=0.009, val_cl_loss=0.317/0.233, val_kl_loss=0.010/0.010]Early Stopping triggered
32%|███▏ | 32/100 [00:31<01:06, 1.03epoch/s, cl_loss=0.036, kl_loss=0.009, val_cl_loss=0.317/0.233, val_kl_loss=0.010/0.010]
Current Mode: multi_batches: ['gene_expr', 'batch_label']
EarlyStopping counter: 20 out of 20
EarlyStopping counter: 2 out of 1
Current Mode: multi_batches_with_ct: ['gene_expr', 'class_label', 'batch_label']
Current Mode: multi_batches: ['gene_expr', 'batch_label']
Current Mode: multi_batches: ['gene_expr', 'batch_label']
Current Mode: multi_batches: ['gene_expr', 'batch_label']
[11]:
adata = stepc.adata
sc.pp.neighbors(adata, use_rep='X_rep')
sc.tl.umap(adata)
sc.pl.umap(adata, color=['Bregma', 'Cell_class'])
[12]:
sc.pp.neighbors(adata, use_rep='X_anchord')
sc.tl.umap(adata)
sc.pl.umap(adata, color=['Bregma', 'Cell_class'])
[15]:
from scib_metrics.benchmark import Benchmarker, BioConservation
biocons = BioConservation(nmi_ari_cluster_labels_leiden=True, nmi_ari_cluster_labels_kmeans=False)
# batcons = BatchCorrection(pcr_comparison=False)
bm = Benchmarker(
adata,
batch_key="Bregma",
label_key="Cell_class",
embedding_obsm_keys=["X_rep", "X_anchord", "X_pca_corrected", "X_pca"],
bio_conservation_metrics=biocons,
# batch_correction_metrics=batcons,
n_jobs=-1,
)
bm.benchmark()
bm.plot_results_table(min_max_scale=False)
Computing neighbors: 100%|██████████| 2/2 [00:09<00:00, 4.70s/it]
Embeddings: 0%| | 0/2 [00:00<?, ?it/s]
Metrics: 0%| | 0/10 [00:00<?, ?it/s]
Metrics: 0%| | 0/10 [00:00<?, ?it/s, Bio conservation: isolated_labels]WARNING:jax._src.xla_bridge:An NVIDIA GPU may be present on this machine, but a CUDA-enabled jaxlib is not installed. Falling back to cpu.
Metrics: 10%|█ | 1/10 [00:02<00:20, 2.30s/it, Bio conservation: isolated_labels]
Metrics: 10%|█ | 1/10 [00:02<00:20, 2.30s/it, Bio conservation: nmi_ari_cluster_labels_leiden]
Metrics: 20%|██ | 2/10 [00:10<00:46, 5.78s/it, Bio conservation: nmi_ari_cluster_labels_leiden]
Metrics: 20%|██ | 2/10 [00:10<00:46, 5.78s/it, Bio conservation: silhouette_label]
Metrics: 30%|███ | 3/10 [00:12<00:27, 3.94s/it, Bio conservation: silhouette_label]
Metrics: 30%|███ | 3/10 [00:12<00:27, 3.94s/it, Bio conservation: clisi_knn]
Metrics: 40%|████ | 4/10 [00:13<00:16, 2.69s/it, Bio conservation: clisi_knn]
Metrics: 40%|████ | 4/10 [00:13<00:16, 2.69s/it, Batch correction: silhouette_batch]
Metrics: 50%|█████ | 5/10 [00:19<00:19, 3.93s/it, Batch correction: silhouette_batch]
Metrics: 50%|█████ | 5/10 [00:19<00:19, 3.93s/it, Batch correction: ilisi_knn]
Metrics: 60%|██████ | 6/10 [00:19<00:10, 2.67s/it, Batch correction: ilisi_knn]
Metrics: 60%|██████ | 6/10 [00:19<00:10, 2.67s/it, Batch correction: kbet_per_label]
Metrics: 70%|███████ | 7/10 [00:35<00:21, 7.01s/it, Batch correction: kbet_per_label]
Metrics: 70%|███████ | 7/10 [00:35<00:21, 7.01s/it, Batch correction: graph_connectivity]
Metrics: 80%|████████ | 8/10 [00:35<00:14, 7.01s/it, Batch correction: pcr_comparison]
Embeddings: 50%|█████ | 1/2 [00:36<00:36, 36.03s/it]atch correction: pcr_comparison]
Metrics: 0%| | 0/10 [00:00<?, ?it/s]
Metrics: 0%| | 0/10 [00:00<?, ?it/s, Bio conservation: isolated_labels]
Metrics: 10%|█ | 1/10 [00:01<00:16, 1.83s/it, Bio conservation: isolated_labels]
Metrics: 10%|█ | 1/10 [00:01<00:16, 1.83s/it, Bio conservation: nmi_ari_cluster_labels_leiden]
Metrics: 20%|██ | 2/10 [00:11<00:50, 6.26s/it, Bio conservation: nmi_ari_cluster_labels_leiden]
Metrics: 20%|██ | 2/10 [00:11<00:50, 6.26s/it, Bio conservation: silhouette_label]
Metrics: 30%|███ | 3/10 [00:12<00:29, 4.17s/it, Bio conservation: silhouette_label]
Metrics: 30%|███ | 3/10 [00:12<00:29, 4.17s/it, Bio conservation: clisi_knn]
Metrics: 40%|████ | 4/10 [00:13<00:15, 2.60s/it, Bio conservation: clisi_knn]
Metrics: 40%|████ | 4/10 [00:13<00:15, 2.60s/it, Batch correction: silhouette_batch]
Metrics: 50%|█████ | 5/10 [00:13<00:09, 1.83s/it, Batch correction: silhouette_batch]
Metrics: 50%|█████ | 5/10 [00:13<00:09, 1.83s/it, Batch correction: ilisi_knn]
Metrics: 60%|██████ | 6/10 [00:13<00:05, 1.27s/it, Batch correction: ilisi_knn]
Metrics: 60%|██████ | 6/10 [00:13<00:05, 1.27s/it, Batch correction: kbet_per_label]
Metrics: 70%|███████ | 7/10 [00:30<00:18, 6.29s/it, Batch correction: kbet_per_label]
Metrics: 70%|███████ | 7/10 [00:30<00:18, 6.29s/it, Batch correction: graph_connectivity]
Embeddings: 100%|██████████| 2/2 [01:06<00:00, 33.24s/it]atch correction: pcr_comparison]
[15]:
<plottable.table.Table at 0x7fdb6c7edd20>
Spatial domain level heterogeneity by spatially smoothing the embedding in cell state level¶
[2]:
model_config = dict(
n_glayers=4,
max_neighbors=20,
variational=True,
)
run_config = dict(
epochs=400,
split_rate=0.2,
n_iterations=800,
batch_size=1024,
graph_batch_size=2,
tune_lr=1e-4,
beta=1e-4,
e2e=False,
)
[3]:
stepc = stModel(
file_path=st_file,
n_top_genes=None,
batch_key='Bregma',
filtered=True,
layer_key=None,
log_transformed=True,
coord_keys=['Centroid_X', 'Centroid_Y'],
module_dim=30,
hidden_dim=64,
n_modules=20,
decoder_type='zinb',
edge_clip=None,
dispersion='gene',
use_earlystop=True,
**model_config,
)
Using all genes
Using given geneset
Adding count data to layer 'counts'
Dataset Done
================Dataset Info================
Batch key: Bregma
Class key: None
Number of Batches: 5
Number of Classes: None
Gene Expr: (28316, 161)
Batch Label: (28316,)
============================================
[4]:
stepc.run(**run_config)
Training with 2 stages pattern: 1/2
Performing global random split
100%|██████████| 400/400 [03:48<00:00, 1.75epoch/s, kl_loss=0.880, recon_loss=146.231, val_kl_loss=0.884, val_recon_loss=147.476]
Training graph with multiple batches
Training with 2 stages pattern: 2/2
100%|██████████| 800/800 [07:45<00:00, 1.72epoch/s, kl_loss=0.012, recon_loss=212.722]
[5]:
adata = stepc.adata
sc.pp.neighbors(adata, use_rep='X_unsmoothed')
sc.tl.umap(adata)
sc.pl.umap(adata, color=['Bregma', 'Cell_class'])
[6]:
stepc.cluster(adata=adata, use_rep='X_smoothed', n_clusters=8)
sc.pp.neighbors(adata, use_rep='X_smoothed')
sc.tl.umap(adata)
sc.pl.umap(adata, color=['Bregma', 'Cell_class', 'domain'])
[12]:
stepc.spatial_plot(color='domain', with_images=False, size=40)
Title not provided, or length does not match the number of batches. Using feature names as title.
E2E mode for direct spatial domain identification¶
Model config & running config¶
First, set e2e=True
to enable the e2e mode. In e2e mode, these parameters used in the non e2e mode will be ignored: - epochs
- batch_size
- split_rate
Instead, a graph sampling strategy will be used to train the model.
[2]:
model_config = dict(
n_glayers=4,
max_neighbors=30,
variational=True,
)
run_config = dict(
n_iterations=2000,
graph_batch_size=1,
beta=1e-2,
e2e=True,
contrast=True,
)
Load data and add annotations of zones¶
Note: The annotations of zones is obtained from the tutorial of BASS
[ ]:
adata = sc.read_h5ad(st_file)
for batch in os.listdir('./data/merfish_anno/'):
anno = pd.read_csv(f'./data/merfish_anno/{batch}', index_col=0)
batch_id = batch[:-4][6:]
adata.obs.loc[adata.obs['Bregma'] == float(batch_id), 'annotation'] = anno['x'].values
Set up STEP object¶
[6]:
stepc = stModel(
adata=adata,
n_top_genes=None,
geneset_to_use=adata.var_names,
batch_key='Bregma',
filtered=True,
layer_key=None,
log_transformed=True,
coord_keys=['Centroid_X', 'Centroid_Y'],
module_dim=30,
hidden_dim=64,
n_modules=20,
decoder_type='zinb',
edge_clip=None,
**model_config,
)
Adding count data to layer 'counts'
Dataset Done
Modeling with 1 components
{'variational': True, 'input_dim': 161, 'hidden_dim': 64, 'module_dim': 30, 'decoder_input_dim': None, 'n_modules': 20, 'edge_clip': None, 'decoder_type': 'zinb', 'n_glayers': 4}
Generate spatially smoothed embeddings and cluster them to obtain spatial domains¶
[7]:
stepc.run(**run_config)
adata = stepc.adata
Training with e2e pattern
Training graph with multiple batches
100%|██████████| 2000/2000 [04:22<00:00, 7.60step/s, recon_loss=370.799, kl_loss=0.428, contrast_loss=0.011, val_recon_loss=-, val_kl_loss=-, val_contrast_loss=-]
[9]:
stepc.cluster(adata, use_rep='X_smoothed', n_clusters=8,)
Visualization¶
[10]:
for batch in adata.obs['Bregma'].unique():
_adata = adata[adata.obs['Bregma'] == batch]
plt.title(batch)
plt.scatter(x=_adata.obs['Centroid_X'],
y=_adata.obs['Centroid_Y'],
c=_adata.obs['domain'].values.astype(int),
s=5, cmap='tab20')
plt.show()
[11]:
adata.obs['annotation'] = adata.obs['annotation'].astype('category')
Metrics for evaluation of the obtained spatial domains compared to the annotations¶
[13]:
from sklearn.metrics import adjusted_rand_score
for batch in adata.obs['Bregma'].unique():
_adata = adata[adata.obs['Bregma'] == batch]
ari = adjusted_rand_score(_adata.obs['domain'].astype(int), _adata.obs['annotation'].cat.codes)
print(f"{batch}: {ari}")
-0.04: 0.5337230222364567
-0.09: 0.6226701393120998
-0.14: 0.49295132575539485
-0.19: 0.6502377274236062
-0.24: 0.6812873877486072