EntryPoints¶
STEP follows the workflows of AnnData
and Scanpy
, and provides 3 major entry points for the users to interact with the package and analyze their data. All corresponding python object take the data, data processing parameters and model configurations as inputs, to perform data preprocessing, dataset construction, and functional model initialization.
As as result, no more preprocessing is need except gene selection and cell filtering.
More details about the usage of the entry points can be found in the API reference.
1. step.scModel
¶
For the generation of embedding for scRNA-seq or SRT data without incorporating spatial information. The following example showcase how to handle an AnnData
object adata.
Two important parameters are adata.obs
related keys:
batch_key
: name of key inadata.obs
that stored batch indicators for each cell/spot. This information will be used to batch-correction.None
for single-batch data.class_key
: name of key inadata.obs
that stored cell-type/cell-state annotations for each cell/spot. This information will be used to refine the embedding if provided, otherwise the model will run in unsupervised mode.
Embedding generation¶
from step import scModel
stepc = scModel(
adata,
n_top_genes=2000,
geneset_to_use=None, # or predefined gene list
batch_key='batch', # None for single-batch data
class_key='celltype', # None for unsupervised mode
)
stepc.run(
epochs=400,
batch_size=1024,
split_rate=0.2,
)
Save the model config and weights¶
stepc.save('scmodel_config')
This will create a folder named scmodel_config
in the current working directory, containing the config.json
, model.pth
files.
Do downstream analysis with scanpy workflow¶
adata = stepc.adata
# 'X_rep' is the embedding generated by STEP in unsupervised mode
sc.pp.neighbors(adata, use_rep='X_rep')
sc.tl.umap(adata)
sc.tl.leiden(adata)
sc.pl.umap(adata, color=['celltype', 'batch', 'leiden'])
# 'X_anchrod' is the embedding refined from 'X_rep' by using cell-type annotations
sc.pp.neighbors(adata, use_rep='X_anchord')
sc.tl.umap(adata)
sc.tl.leiden(adata)
sc.pl.umap(adata, color=['celltype', 'batch', 'leiden'])
2. step.stModel
¶
For the generation of embedding for SRT data with spatial information. STEP does not rely on the contiguity between the sections, so just pass the AnnData
object adata
containing the section identifiers stored in adata.obs
as batch_key
.
The parameters different from scModel
is two spatial neighboring graph associated ones:
coord_keys
: keys inadata.obs
that stored the spatial coordinates of each spot/cell. Default is('array_row', 'array_col')
for Visium data.edge_clip
: cut-off distance between two indicies of grids, rather for pixel distance. 2 for Visium; 1 for Visium HD, ST, Stereo-seq bin; None for activation of the kNN based graph construction.max_neighs
: maximum number of neighbors for each cell using kNN for image segmentation based SRT data. Euclidean distance is used for the calculation of the neighbors.
In default, the stModel
will run on the end-to-end mode(the ‘fast mode’ in paper), which means the model will be trained on the spatial graph directly. If you want to train the model on the gene expression data first, and then smooth the embedding with the spatial information, you can set e2e=False
.
Embedding generation¶
from step import stModel
stepc = stModel(
adata,
n_top_genes=2000,
batch_key='library_id', # None for single section data
coord_keys=('array_row', 'array_col'),
geneset_to_use=None, # or predefined gene list
edge_clip=2,
max_neighs=30,
)
stepc.run(
e2e=True,
n_iterations=2000,
n_samples=2048, # number of sampled nodes for each iteration
graph_batch_size=2, # when multiple sections are available
)
Save the model config and weights¶
stepc.save('stmodel_config')
identify spatial domains and sub-domains¶
stepc.cluster(n_clusters=10)
stepc.sub_cluster(n_clusters=3, pre_key='domain') # identify 3 sub-domains for each identified domains
Do visualization with wrapped scanpy plotting functions when containing multiple sections¶
stepc.spatial_plot(
color='domain',
with_images=True, # show the images of the sections if available
**kwargs,
)
# Perfrom downstream analysis with scanpy workflow
adata = stepc.adata
# 'X_smoothed' is the embedding generated by STEP
sc.pp.neighbors(adata, use_rep='X_smoothed')
sc.tl.umap(adata)
sc.tl.leiden(adata)
sc.pl.umap(adata, color=['domain', 'leiden', 'library_id'])
3. step.crossModel
¶
For the integration of scRNA-seq and SRT data.
Following parameters are different from scModel
and stModel
:
st_adata
: theAnnData
object for the SRT data.st_batch_key
: name of key inst_adata.obs
that stored batch indicators for each spot. This information will be used to batch-correction.sc_adata
: theAnnData
object for the scRNA-seq data.
Co-embeding¶
from step import crossModel
stepc = crossModel(
sc_adata=sc_adata,
st_adata=st_adata,
n_top_genes=2000,
batch_key='batch',
class_key='celltype',
st_batch_key='library_id',
)
stepc.integrate(
epochs=400,
batch_size=1024,
split_rate=0.2,
need_anchors=False, # whether use cell-type annotations to postprocess the co-embedding
)
# Save the model config and weights
stepc.save('crossmodel_config')
After co-embedding, the SRT part can be further processed by STEP’s spatial model to identify spatial domains, but only recommended on single section data. Because the batch-effect seems to be retrived by the section-depnenent spatial bias when training the spatial model on multiple sections. Thereby, for multi-section scenario, we recommend use step.stModel
to identify spatial domains.
Smooth the embedding of SRT data¶
stepc.generate_domains(
use_st_decoder=False, # whether to train a new decoder for the SRT data
epochs=800, # epochs for the training of the decoder
batch_size=1024, # batch size for the training of the decoder
smooth_epochs=800, # epochs for the smoothing of the embedding with training a spaital model
)
Domain-wise Cell-type Deconvolution¶
This process aims to infer the cell-type composition in each spot of SRT data (non-single-cell resolution). Any available cell-type annoations stored in sc_adata.obs
can passed to arg cell_type_key
; and so do the any spatial-domain annotations, can be passed to arg domain_key
to support the deconvolution. Recommend passing sub_domain
to match sub_types
.
stpec.deconv(
epochs=1500,
domain_key='domain'
cell_type_key='celltype'
)
Visualizing the results:
celltypes = stepc.st_adata.obsm['deconv'].columns
stpec.st_adata.obs = stpec.st_adata.obsm['deconv']
stepc.spatial_plot(color=celltypes)
# some cell-types in section 'A'
stepc.spatial_plot(
color=celltypes[:4],
library_id='A',
)