The scSorter package implements the semi-supervised cell type assignment algorithm described in "scSorter: assigning cells to known cell types according to known marker genes". This algorithm assigns cells to known cell types, assuming that the identities of marker genes are given but the exact expression levels of marker genes are unavailable. This vignette will illustrate the steps to apply the algorithm to single-cell RNA sequencing (scRNA-seq) data.
scSorter takes as input data the expression matrix from single-cell RNA sequencing and the annotation file that specifies the names of marker genes for each cell type of interest. scSorter assumes the input expression data have been properly normalized for the library size and have been properly transformed (e.g. log-transformation) to stabilize the variance.
To start the analysis, load scSorter package:
library(scSorter)
In this vignette, we will use the data created by the Tabula Muris Consortium (Tabula Muris Consortium and others 2018) to illustrate the procedure for scSorter to assign cells to known cell types. We follow the analysis presented in our paper and focus on pancreas tissue from which 1,564 cells with valid cell type annotation are available. The cells are from: pancreatic A (390 cells), pancreatic B (449 cells), pancreatic D (140 cells), pancreatic PP (73 cells), pancreatic acinar (182 cells), pancreatic ductal (161 cells), pancreatic stellate (49 cells), endothelial (66 cells) and immune (54 cells). Marker genes for these cell types are extracted from the original study.
We first load the data to ensure they are in the correct formats.
load(url('https://github.com/hyguo2/scSorter/blob/master/inst/extdata/TMpancreas.RData?raw=true'))
The expression matrix expr should represent genes by rows and cells by columns.
expr[1:5, 1:5]
#> A1.MAA000884.3_10_M.1.1 A1.MAA001862.3_39_F.1.1
#> 0610005C13Rik 0 0
#> 0610007C21Rik 168 162
#> 0610007L01Rik 1 2
#> 0610007N19Rik 0 0
#> 0610007P08Rik 0 0
#> A10.MAA000884.3_10_M.1.1 A10.MAA001857.3_38_F.1.1
#> 0610005C13Rik 0 0
#> 0610007C21Rik 121 18
#> 0610007L01Rik 44 170
#> 0610007N19Rik 7 22
#> 0610007P08Rik 2 0
#> A11.MAA000577.3_8_M.1.1
#> 0610005C13Rik 0
#> 0610007C21Rik 373
#> 0610007L01Rik 0
#> 0610007N19Rik 0
#> 0610007P08Rik 0
dim(expr)
#> [1] 23433 1564
Next, we check the annotation file.
head(anno)
#> Type Marker Weight
#> 1 Endothelial_cells Pecam1 2
#> 2 Endothelial_cells Cdh5 2
#> 3 Endothelial_cells Kdr 2
#> 4 Immune Ptprc 2
#> 5 Pancreatic_A_cells Gcg 2
#> 6 Pancreatic_A_cells Mafb 2
The annotation file could be stored in a matrix or a data frame. Two columns, "Type" and "Marker", are mandatory as they record the names of marker genes for each cell type of interest and will guide scSorter to assign cells to each cell type. The third column "Weight" is optional. As explained in our paper, weights could be assigned to each marker gene to represent their relative importance during the cell type assignment. A larger weight reflects more confidence of a gene being a marker gene of the corresponding cell type. If such knowledge is not available, we recommend simply choose a constant value for all marker genes by setting the default_weight option in the scSorter main function and the "Weight" column could therefore be omitted.
Now that the formats of these data are correct, we could proceed to the next step.
scSorter does not need all genes from the original expression matrix to conduct cell type assignment. Instead, it uses only marker genes and a selected group of highly variable genes other than marker genes to conduct the analysis. scSorter also requires the input expression data to be properly normalized for the library size and properly transformed to stabilize the variance. Here is a simple example of how this can be done:
Choose highly variable genes.
topgenes = xfindvariable_genes(expr, ngenes = 2000)
Normalize the input data.
expr = xnormalize_scData(expr)
We also filter out genes with non-zero expression in less than 10% of total cells.
topgene_filter = rowSums(as.matrix(expr)[topgenes, ]!=0) > ncol(expr)*.1
topgenes = topgenes[topgene_filter]
At last, we subset the preprocessed expression data. Now, we are ready to run scSorter.
picked_genes = unique(c(anno$Marker, topgenes))
expr = expr[rownames(expr) %in% picked_genes, ]
The above code is only for showing the concept. In practice, there are plenty of packages that could preprocess the data. Here, we use the package Seurat to illustrate this step (please remove the # mark in the code when you run it).
First, we normalize and transform the data with NormalizeData() function.
#library(Seurat)
#expr_obj = CreateSeuratObject(expr)
#expr_obj <- NormalizeData(expr_obj, normalization.method = "LogNormalize", scale.factor = 10000, verbose = F)
Then, we choose highly variable genes by FindVariableFeatures() function. We also filter out genes with non-zero expression in less than 10% of total cells.
#expr_obj <- FindVariableFeatures(expr_obj, selection.method = "vst", nfeatures = 2000, verbose = F)
#topgenes <- head(VariableFeatures(expr_obj), 2000)
#expr = GetAssayData(expr_obj)
#topgene_filter = rowSums(as.matrix(expr)[topgenes, ]!=0) > ncol(expr)*.1
#topgenes = topgenes[topgene_filter]
At last, we subset the preprocessed expression data. Now, we are ready to run scSorter.
#picked_genes = unique(c(anno$Marker, topgenes))
#expr = expr[rownames(expr) %in% picked_genes, ]
The scSorter function requires the preprocessed data as input. As mentioned in Section 1, if the "Weight" column is omitted in the annotation file, a single weight should be specified for marker genes by using the default_weight option. Otherwise, the algorithm will use the weights from the annotation file to conduct the analysis.
rts <- scSorter(expr, anno)
The cell type assignment results are stored in the Pred_Type vector.
print(table(rts$Pred_Type))
#>
#> Endothelial_cells Immune Pancreatic_A_cells
#> 63 41 416
#> Pancreatic_Acinar_cells Pancreatic_B_cells Pancreatic_D_cells
#> 185 457 116
#> Pancreatic_Ductal_cells Pancreatic_PP_cells Pancreatic_Stellate_cells
#> 162 73 33
#> Unknown
#> 18
The misclassification rate is:
mis_rate = 1 - mean(rts$Pred_Type == true_type)
round(mis_rate, 4)
#> [1] 0.055
The confusion matrix is:
table(true_type, rts$Pred_Type)
#>
#> true_type Endothelial_cells Immune Pancreatic_A_cells
#> Endothelial_cells 63 0 0
#> Immune 0 41 6
#> Pancreatic_A_cells 0 0 384
#> Pancreatic_Acinar_cells 0 0 0
#> Pancreatic_B_cells 0 0 0
#> Pancreatic_D_cells 0 0 16
#> Pancreatic_Ductal_cells 0 0 0
#> Pancreatic_PP_cells 0 0 9
#> Pancreatic_Stellate_cells 0 0 1
#>
#> true_type Pancreatic_Acinar_cells Pancreatic_B_cells
#> Endothelial_cells 2 1
#> Immune 3 0
#> Pancreatic_A_cells 0 1
#> Pancreatic_Acinar_cells 176 2
#> Pancreatic_B_cells 0 446
#> Pancreatic_D_cells 0 6
#> Pancreatic_Ductal_cells 1 0
#> Pancreatic_PP_cells 0 1
#> Pancreatic_Stellate_cells 3 0
#>
#> true_type Pancreatic_D_cells Pancreatic_Ductal_cells
#> Endothelial_cells 0 0
#> Immune 0 0
#> Pancreatic_A_cells 0 0
#> Pancreatic_Acinar_cells 0 0
#> Pancreatic_B_cells 0 0
#> Pancreatic_D_cells 115 0
#> Pancreatic_Ductal_cells 1 159
#> Pancreatic_PP_cells 0 0
#> Pancreatic_Stellate_cells 0 3
#>
#> true_type Pancreatic_PP_cells Pancreatic_Stellate_cells
#> Endothelial_cells 0 0
#> Immune 0 0
#> Pancreatic_A_cells 5 0
#> Pancreatic_Acinar_cells 1 0
#> Pancreatic_B_cells 3 0
#> Pancreatic_D_cells 3 0
#> Pancreatic_Ductal_cells 0 0
#> Pancreatic_PP_cells 61 0
#> Pancreatic_Stellate_cells 0 33
#>
#> true_type Unknown
#> Endothelial_cells 0
#> Immune 4
#> Pancreatic_A_cells 0
#> Pancreatic_Acinar_cells 3
#> Pancreatic_B_cells 0
#> Pancreatic_D_cells 0
#> Pancreatic_Ductal_cells 0
#> Pancreatic_PP_cells 2
#> Pancreatic_Stellate_cells 9
Tabula Muris Consortium and others. 2018. “Single-Cell Transcriptomics of 20 Mouse Organs Creates a Tabula Muris.” Nature 562 (October). Genome Research: 367–72. doi:10.1038/s41586-018-0590-4.