不知不觉,时光飞逝,跟曾老师学习单细胞已经有一段时间,在学习使用软件进行单细胞数据分析时,发现通过对比seurat学习,学习scanpy会比较丝滑,总结了如下的10个实用小技巧。
我们通过加载各自的内置数据集来进行技巧演示:
加载实例数据
seurat:
library(Seurat)
#加载实例数据
data('pbmc_small')
#查看Seurat对象
print(pbmc_small)
结果:
An object of class Seurat
230 features across 80 samples within 1 assay
Active assay: RNA (230 features, 20 variable features)
3 layers present: counts, data, scale.data
2 dimensional reductions calculated: pca, tsne
注意:使用的是seurat V5版本,但是发现读取内置的数据集的seurat对象的RNA不是一个Formal class 'Assay5' ;一般的,使用seurat V5版本创建的seurat对象的RNA是一个Formal class 'Assay5' 。
scanpy:
import scanpy as sc
#加载实例数据
adata=sc.datasets.pbmc68k_reduced()
#查看scanpy对象
print(adata)
结果:
AnnData object with n_obs × n_vars = 700 × 765
obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'G2M_score', 'phase', 'louvain'
var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'distances', 'connectivities'
1.查看barcodes的名称
seurat:
print(colnames(pbmc_small))
结果:
[1] "ATGCCAGAACGACT" "CATGGCCTGTGCAT" "GAACCTGATGAACC" "TGACTGGATTCTCA" "AGTCAGACTGCACA"
[6] "TCTGATACACGTGT" "TGGTATCTAAACAG" "GCAGCTCTGTTTCT" "GATATAACACGCAT" "AATGTTGACAGTCA"
[11] "AGGTCATGAGTGTC" "AGAGATGATCTCGC" "GGGTAACTCTAGTG" "CATGAGACACGGGA" "TACGCCACTCCGAA"
[16] "CTAAACCTGTGCAT" "GTAAGCACTCATTC" "TTGGTACTGAATCC" "CATCATACGGAGCA" "TACATCACGCTAAC"
............
............
scanpy:
print(adata.obs_names)
结果:
['AAAGCCTGGCTAAC-1', 'AAATTCGATGCACA-1', 'AACACGTGGTCTTT-1',
'AAGTGCACGTGCTA-1', 'ACACGAACGGAGTG-1', 'ACAGTTCTTAGCCA-1',
'ACATTCTGACTACG-1', 'ACCCTCGAGTGAGG-1', 'ACTGGCCTTTTCGT-1',
'ACTTGGGAACCAGT-1',
...
'TCTGATACGGTCTA-8', 'TGACGCCTCTCAGA-8', 'TGACTTACGGTCTA-8',
'TGCACAGATGCATG-8', 'TGGAGACTAGTAGA-8', 'TGGCACCTCCAACA-8',
'TGTGAGTGCTTTAC-8', 'TGTTACTGGCGATT-8', 'TTCAGTACCGGGAA-8',
'TTGAGGTGGAGAGC-8']
2.查看barcodes的meta_data
seurat:
print(pbmc_small@meta.data)
结果:
orig.ident nCount_RNA nFeature_RNA RNA_snn_res.0.8 letter.idents
ATGCCAGAACGACT SeuratProject 70 47 0 A
CATGGCCTGTGCAT SeuratProject 85 52 0 A
GAACCTGATGAACC SeuratProject 87 50 1 B
TGACTGGATTCTCA SeuratProject 127 56 0 A
................ ... ... ... ...
................ ... ... ... ...
GAGTTGTGGTAGCT SeuratProject 527 47 0 A
GACGCTCTCTCTCG SeuratProject 202 30 0 A
AGTCTTACTTCGGA SeuratProject 157 29 0 A
GGAACACTTCAGAC SeuratProject 150 30 0 A
CTTGATTGATCTTC SeuratProject 233 76 1 B
scanpy:
print(adata.obs)
结果:
bulk_labels n_genes ... phase louvain
index ...
AAAGCCTGGCTAAC-1 CD14+ Monocyte 1003 ... G1 1
AAATTCGATGCACA-1 Dendritic 1080 ... S 1
AACACGTGGTCTTT-1 CD56+ NK 1228 ... G1 3
AAGTGCACGTGCTA-1 CD4+/CD25 T Reg 1007 ... G2M 9
ACACGAACGGAGTG-1 Dendritic 1178 ... G1 2
... ... ... ... ...
TGGCACCTCCAACA-8 Dendritic 1166 ... G1 2
TGTGAGTGCTTTAC-8 Dendritic 1014 ... G1 1
TGTTACTGGCGATT-8 CD4+/CD25 T Reg 1079 ... S 0
TTCAGTACCGGGAA-8 CD19+ B 1030 ... S 4
TTGAGGTGGAGAGC-8 Dendritic 1552 ... G1 2
3.查看features的名称
seurat:
print(rownames(pbmc_small))
结果:
[1] "MS4A1" "CD79B" "CD79A" "HLA-DRA"
[5] "TCL1A" "HLA-DQB1" "HVCN1" "HLA-DMB"
[9] "LTB" "LINC00926" "FCER2" "SP100"
[13] "NCF1" "PPP3CC" "EAF2" "PPAPDC1B"
... ...
... ...
[217] "KLRD1" "ARHGDIA" "IL2RB" "CLIC3"
[221] "PPP1R18" "CD247" "ALOX5AP" "XCL2"
[225] "C12orf75" "RARRES3" "PCMT1" "LAMP1"
[229] "SPON2" "S100B"
scanpy:
print(adata.var_names)
结果:
['HES4', 'TNFRSF4', 'SSU72', 'PARK7', 'RBP7', 'SRM', 'MAD2L2', 'AGTRAP',
'TNFRSF1B', 'EFHD2',
...
'ATP5O', 'MRPS6', 'TTC3', 'U2AF1', 'CSTB', 'SUMO3', 'ITGB2', 'S100B',
'PRMT2', 'MT-ND3']
4.查看核糖体基因
seurat:
genes=rownames(pbmc_small)
print(genes[grep('^RP[SL]',genes)])
结果:
[1] "RPL7L1"
scanpy:
ribo=adata.var_names.str.match('^RP[SL]')
print(adata.var_names[ribo])
结果:
#和Seurat使用的实例数据集不一样,得到的结果也不一样
['RPL39', 'RPS24', 'RPL36AL', 'RPS27L']
5.查看barcodes的数量:
seurat:
print(ncol(pbmc_small))
结果:
[1] 80
scanpy:
print(len(adata.obs_names))
结果:
700
6.查看features的数量:
seurat:
print(nrow(pbmc_small))
结果:
[1] 230
scanpy:
print(len(adata.var_names))
结果:
765
7.取子集:
seurat:
第一种技巧:
#选取nFeature_RNA<50的子集
sub_pbmc_small=subset(pbmc_small,subset = nFeature_RNA<50)
#查看子集
print(sub_pbmc_small)
结果:
An object of class Seurat
230 features across 36 samples within 1 assay
Active assay: RNA (230 features, 20 variable features)
3 layers present: counts, data, scale.data
2 dimensional reductions calculated: pca, tsne
第二种技巧:
#选取nFeature_RNA<50并且只含有核糖体基因的子集
genes=rownames(pbmc_small)
ribo=genes[grep('^RP[SL]',genes)]
cells=WhichCells(pbmc_small,expression=nFeature_RNA<50)
sub_pbmc_small=subset(pbmc_small,features=ribo,cells=cells)
#查看子集
print(sub_pbmc_small)
结果:
An object of class Seurat
1 features across 36 samples within 1 assay
Active assay: RNA (1 features, 0 variable features)
2 layers present: counts, data
1 dimensional reduction calculated: tsne
scanpy:
#选取线粒体比率>0.02的子集
sub_adata=adata[adata.obs.percent_mito>0.02]
#查看子集
print(sub_adata)
结果:
View of AnnData object with n_obs × n_vars = 188 × 765
obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'G2M_score', 'phase', 'louvain'
var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'distances', 'connectivities'
8.查看counts数据:
seurat:
print(pbmc_small@assays$RNA@counts)
#seurat V5版本操作: print(seurat_obj@assays$RNA@layers$counts)
结果:
230 x 80 sparse Matrix of class "dgCMatrix"
[[ suppressing 38 column names ‘ATGCCAGAACGACT’, ‘CATGGCCTGTGCAT’, ‘GAACCTGATGAACC’ ... ]]
[[ suppressing 38 column names ‘ATGCCAGAACGACT’, ‘CATGGCCTGTGCAT’, ‘GAACCTGATGAACC’ ... ]]
MS4A1 . . . . . . . . . . 2 2 4 4 2 3 3 4 2 3 . . . . 1 . . . . . . . . . . . . .
CD79B 1 . . . . . . . . 1 2 4 3 3 2 3 1 2 2 5 . . . . . . . . . 1 1 . 2 . . . . .
CD79A . . . . . . . . . . . 5 2 2 5 8 1 5 5 12 . . 1 . . . . 1 . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
SPON2 . 1 . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . 3 5 1 3 . . 1 2 ......
S100B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 . . 1
scanpy:
print(adata.X)
结果:
[[-0.326, -0.191, -0.728, ..., -0.21 , -0.636, 4.011],
[ 1.171, -0.191, 0.795, ..., -0.21 , 2.63 , -0.49 ],
[-0.326, -0.191, 0.483, ..., -0.21 , 0.663, -0.49 ],
...,
[-0.326, -0.191, -0.728, ..., -0.21 , -0.636, 1.226],
[-0.326, -0.191, -0.728, ..., -0.21 , -0.636, -0.49 ],
[-0.326, -0.191, 0.148, ..., -0.21 , -0.636, -0.49 ]]
9.查看数据对象的形状:
seurat:
print(dim(pbmc_small))
结果:
[1] 230 80
scanpy:
print(adata.shape)
结果:
(700, 765)
10.获取指定条件的barcodes:
seurat:
#获取counts值大于500的barcodes
print(WhichCells(pbmc_small,expression=nCount_RNA>500))
结果:
[1] "ATTGCACTTGCTTT" "TTGAGGACTACGCA" "ATACCACTCTAAGC" "GACATTCTCCACCT" "ACGTGATGCCATGA"
[6] "ATTGTAGATTCCCG" "GCGTAAACACGGTT" "GAGTTGTGGTAGCT"
scanpy:
i=adata.obs.n_counts>7000
print(adata.obs_names[i])
结果:
['CTTAGACTTATTCC-1', 'TGTTAAGAAGCGGA-1', 'AGGCTAACATAAGG-3','CTTCACCTGTCTAG-4']