This tutorial is based off of the “Orchestrating Single-Cell Analysis with Bioconductor” e-book, a comprehensive resource designed to guide users through the process of analyzing single-cell RNA sequencing (scRNA-seq) data using the Bioconductor ecosystem in R. Links to more detailed explanations of the material are included in the tutorial.
In this tutorial you will learn to:
Kaslisto-Bustools
or procressed counts downloaded from
GEO.DropletUtils
.module purge
module load R-bundle-CRAN/2023.12-foss-2023a
R --vanilla
library(tidyverse)
library(SingleCellExperiment)
library(scater)
library(scran)
library(DropletUtils)
library(Matrix)
library(patchwork)
library(PCAtools)
library(bluster)
library(scDblFinder)
library(AUCell)
library(pheatmap)
library(RColorBrewer)
library(readxl)
data_dir <- "/mnt/research/bioinformaticsCore/projects/petroffm/BCC105_scRNAseq_training/data"
results_dir <- "/mnt/research/bioinformaticsCore/projects/petroffm/BCC105_scRNAseq_training/results/Lukassen_testes"
# if results_dir doesn't exist, make it
if(!dir.exists(results_dir)){dir.create(results_dir)}
# Read in the count matrix that was output by `kb`.
mat <- readMM(paste0(data_dir, "/Lukassen_testes/SRR6129050_mouse1/counts_unfiltered/cells_x_genes.mtx"))
dim(mat)
## [1] 681407 34183
head(mat[,1:5])
## 6 x 5 sparse Matrix of class "dgTMatrix"
##
## [1,] . . . . .
## [2,] . . . . .
## [3,] . . . . .
## [4,] . . . . .
## [5,] . . . . .
## [6,] . . . . .
# read in the gene names
gene_names = read.delim(paste0(data_dir,"/Lukassen_testes/SRR6129050_mouse1/counts_unfiltered/cells_x_genes.genes.names.txt"), header=F)
dim(gene_names)
## [1] 34183 1
head(gene_names)
## V1
## 1 Gm15178
## 2 Gm29155
## 3 C730036E19Rik
## 4 Gm29157
## 5 Gm29156
## 6 Pcmtd1
# add genes as column names
colnames(mat) <- gene_names$V1
head(mat[,1:5])
## 6 x 5 sparse Matrix of class "dgTMatrix"
## Gm15178 Gm29155 C730036E19Rik Gm29157 Gm29156
## [1,] . . . . .
## [2,] . . . . .
## [3,] . . . . .
## [4,] . . . . .
## [5,] . . . . .
## [6,] . . . . .
# read in the cell barcodes
barcodes <- read.delim(paste0(data_dir,"/Lukassen_testes/SRR6129050_mouse1/counts_unfiltered/cells_x_genes.barcodes.txt"), header=F)
dim(barcodes)
## [1] 681407 1
head(barcodes)
## V1
## 1 AAACCTGAGAAACCAT
## 2 AAACCTGAGAAACCGC
## 3 AAACCTGAGAAACCTA
## 4 AAACCTGAGAAACGAG
## 5 AAACCTGAGAAACGCC
## 6 AAACCTGAGAAAGTGG
# add cell barcodes as row names
rownames(mat) <- barcodes$V1
head(mat[,1:5])
## 6 x 5 sparse Matrix of class "dgTMatrix"
## Gm15178 Gm29155 C730036E19Rik Gm29157 Gm29156
## AAACCTGAGAAACCAT . . . . .
## AAACCTGAGAAACCGC . . . . .
## AAACCTGAGAAACCTA . . . . .
## AAACCTGAGAAACGAG . . . . .
## AAACCTGAGAAACGCC . . . . .
## AAACCTGAGAAAGTGG . . . . .
# get the total counts per barcode
rs <- rowSums(mat)
# remove barcodes with no detected counts from the matrix
filt_mat <- mat[rs>0,]
# get total number of cells each gene is expressed in
cells_exp = rowSums(t(filt_mat)!=0)
filt_mat <- filt_mat[,cells_exp>0]
SingleCellExperiment
objectA SingleCellExperiment object is a data structure in R, specifically designed for storing and managing single-cell RNA sequencing (scRNA-seq) data.
Data Storage: The SingleCellExperiment object holds the main data matrix, typically containing gene expression counts, where rows represent genes and columns represent cells.
Metadata: It includes slots for storing metadata related to both the rows (genes/features) and columns (cells/samples). This metadata can include information such as gene annotations or cell type labels.
Assays: The object supports multiple “assays,” allowing you to store different versions of the data (e.g., raw counts, normalized counts, or log-transformed data) within the same object.
Reduced Dimensions: It can store the results of dimensionality reduction techniques like PCA, t-SNE, or UMAP, making it easier to manage and visualize lower-dimensional representations of the data.
Cell and Feature-Level Data: The colData and rowData slots store additional cell-level and gene-level information, respectively, such as cell quality metrics, clustering results, or gene annotations.
Interoperability: SingleCellExperiment objects are designed to be compatible with a wide range of Bioconductor packages, facilitating various downstream analyses, including clustering, differential expression analysis, and trajectory inference.
Right now, we have a matrix where barcodes are rows and genes are
columns. SingleCellExperiment()
expects genes to be rows
and barcodes to be columns, so we transpose the matrix with the
t()
function.
sce <- SingleCellExperiment(assays = list(counts = t(filt_mat)))
To access the count data we just supplied, we can do any one of the following:
DropletUtils
For droplet-based data only (ie 10x and drop-seq)
A useful diagnostic for droplet-based data is the barcode rank plot, which shows the (log-)total UMI count for each barcode on the y-axis and the (log-)rank on the x-axis. This is effectively a transposed empirical cumulative density plot with log-transformed axes. It is useful as it allows users to examine the distribution of total counts across barcodes, focusing on those with the largest counts.
br.out <- barcodeRanks(counts(sce))
head(br.out)
## DataFrame with 6 rows and 3 columns
## rank total fitted
## <numeric> <integer> <numeric>
## AAACCTGAGAAACCAT 453160 3 NA
## AAACCTGAGAAACCGC 537144 2 NA
## AAACCTGAGAAACCTA 279308 6 NA
## AAACCTGAGAAACGAG 629267 1 NA
## AAACCTGAGAAACGCC 453160 3 NA
## AAACCTGAGAAAGTGG 279308 6 NA
# make the rownames a column
br.out$barcodes <- rownames(br.out)
head(br.out)
## DataFrame with 6 rows and 4 columns
## rank total fitted barcodes
## <numeric> <integer> <numeric> <character>
## AAACCTGAGAAACCAT 453160 3 NA AAACCTGAGAAACCAT
## AAACCTGAGAAACCGC 537144 2 NA AAACCTGAGAAACCGC
## AAACCTGAGAAACCTA 279308 6 NA AAACCTGAGAAACCTA
## AAACCTGAGAAACGAG 629267 1 NA AAACCTGAGAAACGAG
## AAACCTGAGAAACGCC 453160 3 NA AAACCTGAGAAACGCC
## AAACCTGAGAAAGTGG 279308 6 NA AAACCTGAGAAAGTGG
# Making the knee plot.
ggplot(as.data.frame(br.out), aes(x=rank, y=total)) +
geom_point() +
geom_hline(aes(yintercept=metadata(br.out)$knee,
linetype="knee"),
color = "dodgerblue") +
geom_hline(aes(yintercept=metadata(br.out)$inflection,
linetype="inflection"),
color = "forestgreen") +
scale_linetype_manual(name = "legend",
values = c(2, 2),
guide = guide_legend(override.aes = list(color = c("forestgreen", "dodgerblue")))) +
scale_x_continuous(trans='log10') +
scale_y_continuous(trans='log10') +
xlab("Rank") +
ylab("Total Counts")
ggsave(file = paste0(results_dir, "/kneeplot.png"))
Empty droplets often contain RNA from the ambient solution, resulting
in non-zero counts after debarcoding. The emptyDrops
function is designed to distinguish between empty droplets and cells. It
does so by testing each barcode’s expression profile for significant
deviation from the ambient profile.
# Set a seed. This can be any number. Using the same number every time ensures that a sequence of random numbers generated is the same every time you run your code.
set.seed(100)
# set `by.rank` to the max number of cells you expect
e.out <- emptyDrops(counts(sce),
by.rank = 2000)
e.out
## DataFrame with 676046 rows and 5 columns
## Total LogProb PValue Limited FDR
## <integer> <numeric> <numeric> <logical> <numeric>
## AAACCTGAGAAACCAT 3 NA NA NA NA
## AAACCTGAGAAACCGC 2 NA NA NA NA
## AAACCTGAGAAACCTA 6 NA NA NA NA
## AAACCTGAGAAACGAG 1 NA NA NA NA
## AAACCTGAGAAACGCC 3 NA NA NA NA
## ... ... ... ... ... ...
## TTTGTCATCTTTACAC 1 NA NA NA NA
## TTTGTCATCTTTACGT 8 NA NA NA NA
## TTTGTCATCTTTAGGG 734 NA NA NA NA
## TTTGTCATCTTTAGTC 7 NA NA NA NA
## TTTGTCATCTTTCCTC 21 NA NA NA NA
# add barcodes to a column
e.out$barcode = rownames(e.out)
# make a vector of significant barcodes
# use a false discovery rate (FDR) of 0.1%,
# meaning that no more than 0.1% of our called
# barcodes should be empty droplets on average.
sig_barcodes <-
as_tibble(e.out) %>%
filter(FDR < .001) %>%
pull(barcode)
# how many cells?
length(sig_barcodes)
## [1] 1537
This is close to the number of cells used in the paper. Filter the sce for these cells.
sce <- sce[,sig_barcodes]
Common metrics include for identifying low quality cells include:
For each cell, we calculate these QC metrics using the
perCellQCMetrics()
function from the scater
package.
# Identifying the mitochondrial transcripts in our SingleCellExperiment.
# mouse mito genes start with "mt"
# human is "MT"
# change for different species
# this is a numeric vector of row numbers
# corresponding to mito genes
is.mito <- grep("^mt", rownames(sce))
# show the gene names
rownames(sce)[is.mito]
## [1] "mt-Nd1" "mt-Nd2" "mt-Co1" "mt-Co2" "mt-Atp8" "mt-Atp6" "mt-Co3"
## [8] "mt-Nd3" "mt-Nd4l" "mt-Nd4" "mt-Nd5" "mt-Nd6" "mt-Cytb"
# get the per cell stats
stats <- perCellQCMetrics(sce,
subsets=list(Mito=is.mito))
head(stats)
## DataFrame with 6 rows and 6 columns
## sum detected subsets_Mito_sum subsets_Mito_detected
## <numeric> <numeric> <numeric> <numeric>
## AAACCTGAGCTTATCG 24918 5292 18 5
## AAACCTGGTTGAGTTC 20332 4995 18 6
## AAACCTGTCAACGAAA 24765 5240 5 3
## AAACCTGTCTCATTCA 7808 3083 557 13
## AAACGGGCACAGGTTT 30331 5766 9 6
## AAACGGGCACTCAGGC 8331 3295 746 11
## subsets_Mito_percent total
## <numeric> <numeric>
## AAACCTGAGCTTATCG 0.0722369 24918
## AAACCTGGTTGAGTTC 0.0885304 20332
## AAACCTGTCAACGAAA 0.0201898 24765
## AAACCTGTCTCATTCA 7.1337090 7808
## AAACGGGCACAGGTTT 0.0296726 30331
## AAACGGGCACTCAGGC 8.9545073 8331
‘sum’ is the the sum of counts for each cell. ‘detected’ is the number of genes with at least one count in each cell. ‘subsets_Mito_percent’ is the percent of the total counts (‘sum’) that come from mitochrondria genes: ‘subsets_Mito_sum’.
summary(stats$sum)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3760 6236 16705 19566 23319 134159
summary(stats$detected)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1144 2708 4324 4126 5098 8944
summary(stats$subsets_Mito_percent)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1065 0.6323 2.0879 1.4527 34.8936
If paper that describes the data outlines the thresholds used to quality filter the cells, you can use those fixed thresholds to filter. For example:
# these aren't actually from the paper
# I made them up as an example
qc.counts <- stats$sum < 5000
qc.nexprs <- stats$detected < 2000
qc.mito <- stats$subsets_Mito_percent > 10
discard <- qc.counts | qc.nexprs | qc.mito
# Summarize the number of cells removed for each reason.
DataFrame(LibSize=sum(qc.counts), NExprs=sum(qc.nexprs), MitoProp=sum(qc.mito), Total=sum(discard))
## DataFrame with 1 row and 4 columns
## LibSize NExprs MitoProp Total
## <integer> <integer> <integer> <integer>
## 1 267 55 85 317
# remove cells that don't meet the thresholds
fixed_sce = sce[,!discard]
fixed_sce
## class: SingleCellExperiment
## dim: 27058 1220
## metadata(0):
## assays(1): counts
## rownames(27058): Gm29155 Pcmtd1 ... ENSMUSG00000095041.8
## ENSMUSG00000095742.2
## rowData names(0):
## colnames(1220): AAACCTGAGCTTATCG AAACCTGGTTGAGTTC ... TTTGTCAGTAATAGCA
## TTTGTCAGTCTCGTTC
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
If the hard thresholds used by the authors are not available, you can
use a relaxed QC strategy and only remove cells with large mitochondrial
proportions, using it as a proxy for cell damage. This reduces the risk
of removing cell types with low RNA content, especially in a
heterogeneous populations with many different cell types. More
information about the isOutlier()
function can be found here
high.mito <- isOutlier(stats$subsets_Mito_percent, type="higher")
# how many high mito cells?
sum(high.mito)
## [1] 292
# What are the cuttoffs?
attr(high.mito, "thresholds")
## lower higher
## -Inf 3.025697
# add the stats information to the sce object as col data
colData(sce) <- cbind(colData(sce), stats)
# add a column indicating the cells you want to discard
# in this case high.mito cells
sce$discard <- high.mito
# plot the distribution of the stats and indicate which cells will be discarded
plot.count =
plotColData(sce,
y="sum",
colour_by="discard") +
scale_y_log10() +
ggtitle("Total count")
plot.genes =
plotColData(sce,
y="detected",
colour_by="discard") +
scale_y_log10() +
ggtitle("Detected genes")
plot.mito =
plotColData(sce,
y="subsets_Mito_percent",
colour_by="discard") +
ggtitle("Mito percent")
plot.all = plot.count | plot.genes | plot.mito
plot.all
ggsave(plot.all,
height = 4,
file = paste0(results_dir, "/qc_plots.png"))
The high mito cells tend to have lower total UMI counts and lower number of detected genes.
Remove the high.mito cells from sce.
sce <- sce[,!high.mito]
sce
## class: SingleCellExperiment
## dim: 27058 1245
## metadata(0):
## assays(1): counts
## rownames(27058): Gm29155 Pcmtd1 ... ENSMUSG00000095041.8
## ENSMUSG00000095742.2
## rowData names(0):
## colnames(1245): AAACCTGAGCTTATCG AAACCTGGTTGAGTTC ... TTTGTCAGTAATAGCA
## TTTGTCAGTCTCGTTC
## colData names(7): sum detected ... total discard
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
To accurately account for differences in library size (total count)
between cells, we will first perform a quick cluster analysis to group
similar cells together and calculate cell-specific scaling factors
(often called size factors) using computeSumFactors()
. See
the detailed explanation from OSCA here
Once we have computed the size factors, we use the
logNormCounts()
function to compute normalized expression
values for each cell. This is done by dividing the count for each gene
with the appropriate size factor for that cell. The function also
log-transforms the normalized values, creating a new assay called
“logcounts”. See the detailed explanation from OSCA here
clusters <- quickCluster(sce)
sce <- computeSumFactors(sce, cluster=clusters)
sce <- logNormCounts(sce)
For cell clustering we need to select genes that contain useful
information about the biology of the system while removing genes that
contain random noise. The simplest approach to feature selection is to
select the most variable genes based on their expression across cells.
We’ll use getTopHVGs()
to detect these genes.
The number of highly variable genes to consider is fairly arbitrary, with any value from 500 to 5000 considered “reasonable”. n=1000 is a common choice.
# store the highly variable genes within the single cell object
rowSubset(sce, "HVGs") <- getTopHVGs(sce, n=1000)
In single-cell RNA-seq analysis, we have a lot of genes being measured across thousands of cells. This creates a massive amount of data, with each cell having a unique expression pattern for thousands of genes. Handling and interpreting such large datasets can be challenging.
Principal Component Analysis (PCA) helps by reducing the complexity of the data. Instead of looking at thousands of genes, PCA combines them into a smaller number of “principal components.” Each principal component is a combination of gene expression patterns that captures the most variation in the data. We’ll use the PCs to define and visualize our cell clusters later.
We’ll run PCA on the expression of the highly variable genes.
sce <- runPCA(sce, subset_row=rowSubset(sce, "HVGs"))
sce
## class: SingleCellExperiment
## dim: 27058 1245
## metadata(0):
## assays(2): counts logcounts
## rownames(27058): Gm29155 Pcmtd1 ... ENSMUSG00000095041.8
## ENSMUSG00000095742.2
## rowData names(1): HVGs
## colnames(1245): AAACCTGAGCTTATCG AAACCTGGTTGAGTTC ... TTTGTCAGTAATAGCA
## TTTGTCAGTCTCGTTC
## colData names(8): sum detected ... discard sizeFactor
## reducedDimNames(1): PCA
## mainExpName: NULL
## altExpNames(0):
runPCA()
reports the first 50 PCs by default, but how
many do you actually need for downstream analysis? A simple heuristic
for choosing the suitable number of PCs involves identifying the elbow
point in the percentage of variance explained by successive PCs. This
refers to the “elbow” in the curve of a scree plot. See OSCA
advanced: Chapter 4 for more details.
We can plot a scree plot and identify the “elbow” using
findElbowPoint()
.
# Percentage of variance explained is tucked away in the attributes.
percent.var <- attr(reducedDim(sce), "percentVar")
chosen.elbow <- findElbowPoint(percent.var)
chosen.elbow
## [1] 4
# make the scree plot
plot(percent.var, xlab="PC", ylab="Variance explained (%)")
abline(v=chosen.elbow, col="red")
# save as a pdf
pdf(file = paste0(results_dir, "/scree.pdf"))
plot(percent.var, xlab="PC", ylab="Variance explained (%)")
abline(v = chosen.elbow, col="red")
dev.off()
## png
## 2
# store only the relevant PCs in the sce
# with the name "PCA.elbow"
reducedDim(sce, "PCA.elbow") <- reducedDim(sce, "PCA")[,1:chosen.elbow]
How do you find the “best” clusters within the single-cell data? I like this answer from the overview of the OSCA clustering chapter:
A more relevant question is “how well do the clusters approximate the cell types or states of interest?” Unfortunately, this is difficult to answer given the context-dependent interpretation of the underlying biology. Some analysts will be satisfied with resolution of the major cell types; other analysts may want resolution of subtypes; and others still may require resolution of different states (e.g., metabolic activity, stress) within those subtypes. Moreover, two clusterings can be highly inconsistent yet both valid, simply partitioning the cells based on different aspects of biology. Indeed, asking for an unqualified “best” clustering is akin to asking for the best magnification on a microscope without any context.
So, there’s no “one-size fits all” solution for finding the best clusters.
One basic way to alter clustering is by changing the value \(k\), or the number of nearest neighbors of
each cell to consider when building clusters.This controls the
resolution of the clustering where higher \(k\) yields broader clusters. You can
exploit this by experimenting with different values of \(k\) to obtain a satisfactory resolution.
clusterCells()
uses k=10
as a default and is a
good place to start.
As mentioned above, we’ll cluster the cells in PC space rather than gene expression space for efficiency. We specify that the dimensionality reduction (“dimred”) we want to use is the PCs we defined using the elbow plot.
clust.10 <- clusterCells(sce, use.dimred="PCA.elbow")
table(clust.10)
## clust.10
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 82 101 49 117 102 95 82 51 89 56 109 96 35 82 41 58
k=10
gives us 16 clusters. Let’s try k=40
and k=20
for comparison.
clust.40 <- clusterCells(sce,
use.dimred="PCA.elbow",
BLUSPARAM=NNGraphParam(k=40))
table(clust.40)
## clust.40
## 1 2 3 4 5 6 7
## 200 242 323 120 147 112 101
clust.20 <- clusterCells(sce,
use.dimred="PCA.elbow",
BLUSPARAM=NNGraphParam(k=20))
table(clust.20)
## clust.20
## 1 2 3 4 5 6 7 8 9 10 11 12 13
## 88 181 112 116 88 106 151 92 72 53 65 57 64
Add the cluster information to the colData field of the sce object.
sce$clust.10 <- clust.10
sce$clust.40 <- clust.40
sce$clust.20 <- clust.20
It can also be useful to plot our QC metrics by cluster to see if
cells cluster based on technical rather than biological factors. We can
do this with plotColData()
. I’ll use clust.20 as an
example.
# plot
plot.count <-
plotColData(sce,
y="sum",
x = "clust.20",
colour_by="clust.20") +
scale_y_log10() +
ggtitle("Total count")
plot.genes <-
plotColData(sce,
y="detected",
x = "clust.20",
colour_by="clust.20") +
scale_y_log10() +
ggtitle("Detected genes")
plot.mito <-
plotColData(sce,
y="subsets_Mito_percent",
x = "clust.20",
colour_by="clust.20") +
ggtitle("Mito percent")
plot.all <- plot.count / plot.genes / plot.mito
plot.all
ggsave(plot.all,
file = paste0(results_dir, "/qc_plots_by_clust20.png"))
Single-cell RNA-seq data is high-dimensional, with thousands of genes measured across thousands to millions of cells. Uniform manifold approximation and projection (UMAP) and \(t\)-stochastic neighbor embedding (t-SNE) effectively reduce this complexity to 2D space, making it easier to visualize while preserving some of the underlying biological structure.
It is arguable whether the UMAP or
t-SNE visualizations are more useful or aesthetically pleasing. UMAP
aims to preserve more global structure but this necessarily reduces
resolution within each visual cluster. However, UMAP is unarguably much
faster, and for that reason alone, it is increasingly displacing t-SNE
as the method of choice for visualizing large scRNA-seq data sets.
Important
When interpreting UMAP and t-SNE plots, it’s safest to focus on local neighborhoods, as these methods prioritizes preserving the relationships between nearest neighbors from high-dimensional space in the 2D embedding. If multiple cell types or clusters appear close together in the plot, they were likely also neighbors in higher-dimensional space. However, distances between non-neighboring cells may not be accurately represented, so using the spacing between clusters to infer similarity between distinct cell types is unreliable.
We’ll use PC space rather than gene expression space to find the UMAP and t-SNE coordinates for efficiency. We specify that the dimensionality reduction (“dimred”) we want to use is the PCs we defined using the elbow plot.
set.seed(100)
# runUMAP
sce <- runUMAP(sce, dimred="PCA.elbow")
# runTSNE
sce <- runTSNE(sce, dimred="PCA.elbow")
sce
## class: SingleCellExperiment
## dim: 27058 1245
## metadata(0):
## assays(2): counts logcounts
## rownames(27058): Gm29155 Pcmtd1 ... ENSMUSG00000095041.8
## ENSMUSG00000095742.2
## rowData names(1): HVGs
## colnames(1245): AAACCTGAGCTTATCG AAACCTGGTTGAGTTC ... TTTGTCAGTAATAGCA
## TTTGTCAGTCTCGTTC
## colData names(11): sum detected ... clust.40 clust.20
## reducedDimNames(4): PCA PCA.elbow UMAP TSNE
## mainExpName: NULL
## altExpNames(0):
Plot the clusters colored by clust.10, clust.20 and clust.40 in UMAP space.
plot10 <- plotReducedDim(sce, dimred="UMAP", colour_by="clust.10")
plot20 <- plotReducedDim(sce, dimred="UMAP", colour_by="clust.20")
plot40 <- plotReducedDim(sce, dimred="UMAP", colour_by="clust.40")
plot.all <- plot10 | plot20 | plot40
plot.all
ggsave(plot.all, file = paste0(results_dir, "/UMAP_ktest.png"), width = 12, height = 4)
Plot the clusters colored by clust.10, clust.20 and clust.40 in TSNE space.
plot10 <- plotReducedDim(sce, dimred="TSNE", colour_by="clust.10")
plot20 <- plotReducedDim(sce, dimred="TSNE", colour_by="clust.20")
plot40 <- plotReducedDim(sce, dimred="TSNE", colour_by="clust.40")
plot.all <- plot10 | plot20 | plot40
plot.all
ggsave(plot.all, file = paste0(results_dir, "/TSNE_ktest.png"), width = 12, height = 4)
“Doublets” are barcodes that contain the transcriptomes of two cells instead of just one.
We’ll use scDblFinder()
from the scDblFinder
package to identify doublets for removal. scDblFinder()
simulates thousands of doublets by combining the expression profiles of
real cells in the data set. It then reports a “doublet score” for each
cell that represents how similar the cell is to the simulated doublets
vs real cells. Higher scoring cells are more likely to be doublets. The
function also classifies the cells based on the scores and other
criteria into “doublets” and “singlets”.
sce.dbl <- scDblFinder(sce)
# how many cells are classified as doublets?
table(sce.dbl$scDblFinder.class)
##
## singlet doublet
## 1194 51
# plot the doublet score in tsne space
tsne.dbl <- plotTSNE(sce.dbl, colour_by="scDblFinder.score")
# plot the clusters in tsne space for reference
tsne.20 <- plotTSNE(sce.dbl, colour_by="clust.20")
# make a vln plot of the doublet score by cluster
vln.dbl <-
plotColData(sce.dbl,
y="scDblFinder.score",
x = "clust.20",
colour_by="scDblFinder.class")
# combine the plots with patchwork
plot.all <- (tsne.dbl | tsne.20) / vln.dbl + plot_layout(heights = c(2, 1))
plot.all
ggsave(plot.all, file = paste0(results_dir, "/TSNE_doublets.png"), width = 10, height=6)
The OSCA authors suggest that all cells from a cluster with a large average doublet score should be considered doublets. If no one cluster is over represented by doublets, as is the case here, it is safer to remove individual doublets.
# select the barcodes corresponding to singlets
singlets <-
as_tibble(colData(sce.dbl)) %>%
mutate(barcodes = rownames(colData(sce.dbl))) %>%
filter(scDblFinder.class == "singlet") %>%
pull(barcodes)
# filter the sce for singlets
sce <- sce[,singlets]
Important Any time you remove cells from the analysis, it’s important to re-normalize the data, find new PCs and recluster.
#normalize
clusters <- quickCluster(sce)
sce <- computeSumFactors(sce, cluster=clusters)
sce <- logNormCounts(sce)
#feature selection
rowSubset(sce, "HVGs") <- getTopHVGs(sce, n=1000)
# PCA
sce <- runPCA(sce, subset_row=rowSubset(sce, "HVGs"))
percent.var <- attr(reducedDim(sce), "percentVar")
chosen.elbow <- findElbowPoint(percent.var)
reducedDim(sce, "PCA.elbow") <- reducedDim(sce, "PCA")[,1:chosen.elbow]
# cluster
clust.20 <- clusterCells(sce,
use.dimred="PCA.elbow",
BLUSPARAM=NNGraphParam(k=20))
sce$clust.20 <- clust.20
# runUMAP
sce <- runUMAP(sce, dimred="PCA.elbow")
# runTSNE
sce <- runTSNE(sce, dimred="PCA.elbow")
To interpret our clustering results, we identify the genes that drive
separation between clusters. These cluster enriched genes allow us to
assign biological meaning to each cluster based on their functional
annotation. We perform differential expression analysis between each
pair of clusters clusters the using scoreMarkers()
function
from scran. For each pairwise comparison three different effect
sizes are reported to describe the size of the gene expression
difference:
logFC.cohen
) is a standardized log-fold change where the
difference in the mean log-expression between groups is scaled by the
average standard deviation across groups. In other words, it is the
number of standard deviations that separate the means of the two groups.
The interpretation is similar to the log-fold change; positive values
indicate that the gene is upregulated in our cluster of interest,
negative values indicate downregulation and values close to zero
indicate that there is little difference.AUC
) represents
the probability that a randomly chosen observation from our cluster of
interest is greater than a randomly chosen observation from the other
cluster. A value of 1 corresponds to upregulation, where all values of
our cluster of interest are greater than any value from the other
cluster; a value of 0.5 means that there is no net difference in the
location of the distributions; and a value of 0 corresponds to
downregulation.logFC.detected
). This
ignores any information about the magnitude of expression, only
considering whether any expression is detected at all. Again, positive
values indicate that a greater proportion of cells express the gene in
our cluster of interest compared to the other cluster.If you ave \(N\) clusters, then
you’ll have \(N-1\) pairwise
comparisons for each gene for each cluster. scoreMarkers()
summarizes the effect sizes of the pairwise comparisons for each cluster
with various summary statistics (mean, median, min, max), that may be
useful in different scenarios. You can read more about them here.
# use the k=20 clusters as an example
markers20 <- scoreMarkers(sce, sce$clust.20)
The scores for all pairwise comparisons involving a particular
cluster are consolidated into a single DataFrame
for that
cluster. The scoreMarkers()
returns a list of
DataFrame
s where each DataFrame
corresponds to
a cluster and each row of the DataFrame
corresponds to a
gene. In the DataFrame
for cluster \(X\), the columns contain the
self.average
, the mean log-expression in \(X\); other.average
, the grand
mean across all other clusters; self.detected
, the
proportion of cells with detected expression in \(X\); other.detected
, the mean
detected proportion across all other clusters; and finally, the effect
size summaries generated from all pairwise comparisons involving \(X\). .
# markers20 is a list object. Each element of the list is a data frame with the summary
markers20
## List of length 12
## names(12): 1 2 3 4 5 6 7 8 9 10 11 12
colnames(markers20[["1"]]) # statistics for cluster 1.
## [1] "self.average" "other.average" "self.detected"
## [4] "other.detected" "mean.logFC.cohen" "min.logFC.cohen"
## [7] "median.logFC.cohen" "max.logFC.cohen" "rank.logFC.cohen"
## [10] "mean.AUC" "min.AUC" "median.AUC"
## [13] "max.AUC" "rank.AUC" "mean.logFC.detected"
## [16] "min.logFC.detected" "median.logFC.detected" "max.logFC.detected"
## [19] "rank.logFC.detected"
I’ve written a function, saveTopGenesDF()
, that pulls
the top n genes for each cluster given a chosen statistic into a data
frame. If write_df = TRUE it also writes a .csv
file to
resuts_dir
.
# `saveTopGeneDF()` is a function to save the top cluster enriched genes for each cluster
# marker_list is the output of scoreMarkers
# topn is the number of genes you want
# stat_oi is the stat you want to sort by
# if write_df=TRUE the df is saved to a csv file
# in results_dir
saveTopGenesDF = function(marker_list,
topn,
stat_oi,
write_df=FALSE){
top_df <- {}
for(i in 1:length(marker_list)){
if(grepl("rank", stat_oi)){ # if the stat is a rank take the min
top_clust <-
as_tibble(marker_list[[i]]) %>% # select element i from the list
mutate(Gene = rownames(marker_list[[i]]), # add a column with the genes
Cluster = paste0("cluster", i)) %>% # add a column with the cluster
slice_min(order_by = !!sym(stat_oi), # order by stat oi and take the min n
n = topn) %>%
select(Cluster, # select relevant columns
Gene,
self.average,
other.average,
self.detected,
other.detected,
!!stat_oi)
top_df <- rbind(top_df, top_clust) #bind into one df
} else { # if the stat is not a rank take the max
top_clust <-
as_tibble(marker_list[[i]]) %>% # select element i from the list
mutate(Gene = rownames(marker_list[[i]]), # add a column with the genes
Cluster = paste0("cluster", i)) %>% # add a column with the cluster
slice_max(order_by = !!sym(stat_oi), # order by stat oi and take the top n
n = topn) %>%
select(Cluster, # select relevant columns
Gene,
self.average,
other.average,
self.detected,
other.detected,
!!stat_oi)
top_df <- rbind(top_df, top_clust) #bind into one df
}
}
if(write_df==TRUE){
write.csv(top_df, file = paste0(results_dir, "/marker_genes_top", topn, "_", stat_oi, ".csv"))
}
return(top_df)
}
mean.logFC.cohen
is a reasonable default for most
applications. Here, we’ll pull the top 3 genes by
mean.logFC.cohen
for each cluster.
top_mean <- saveTopGenesDF(markers20, 3, "mean.logFC.cohen")
# show the first three clusters
as.data.frame(top_mean) %>% filter(Cluster %in% c("cluster1", "cluster2", "cluster3"))
## Cluster Gene self.average other.average self.detected
## 1 cluster1 H1f7 5.416820 2.6026708 1
## 2 cluster1 Irgc1 6.092426 4.0904163 1
## 3 cluster1 Hmgb4 6.546284 4.1063604 1
## 4 cluster2 1700001P01Rik 4.584547 1.5035519 1
## 5 cluster2 1700029E06Rik 3.131958 0.5249133 1
## 6 cluster2 Tex36 4.744866 2.1979527 1
## 7 cluster3 Mphosph8 3.066313 0.4437957 1
## 8 cluster3 Mllt10 3.760494 0.6540869 1
## 9 cluster3 D130017N08Rik 2.991809 0.3582564 1
## other.detected mean.logFC.cohen
## 1 0.7817510 3.688171
## 2 0.9653472 3.366224
## 3 0.9714688 3.353837
## 4 0.5772189 5.192025
## 5 0.2496723 5.031906
## 6 0.7120591 4.458102
## 7 0.2668179 4.950068
## 8 0.3646330 4.849644
## 9 0.2272789 4.187093
min.logFC.cohen
is more strict, and can be useful to
find genes that are specifically enriched in one cluster.
top_min <- saveTopGenesDF(markers20, 3, "min.logFC.cohen")
# show the first three clusters
as.data.frame(top_min) %>% filter(Cluster %in% c("cluster1", "cluster2", "cluster3"))
## Cluster Gene self.average other.average self.detected
## 1 cluster1 1700017M07Rik 1.741367 0.09938439 0.6838710
## 2 cluster1 Csrnp1 2.349789 0.40863359 0.7935484
## 3 cluster1 Tmem144 3.108212 0.46517059 0.8709677
## 4 cluster2 1700029E06Rik 3.131958 0.52491328 1.0000000
## 5 cluster2 4930596D02Rik 3.162792 0.75857982 0.9898990
## 6 cluster2 Tex36 4.744866 2.19795270 1.0000000
## 7 cluster3 Cks2 3.501081 0.79304798 1.0000000
## 8 cluster3 Lypd10 2.190546 0.23745206 0.9829060
## 9 cluster3 Gm21269 2.099048 0.16499472 0.9743590
## other.detected min.logFC.cohen
## 1 0.05876498 1.2316642
## 2 0.28922950 1.0033063
## 3 0.22656074 0.9774013
## 4 0.24967229 1.3122755
## 5 0.37159787 1.3069628
## 6 0.71205906 1.1382828
## 7 0.46259710 1.8619876
## 8 0.17156476 1.6595942
## 9 0.12063799 1.6513117
Plot the top genes from “mean.logFC.cohen” and “min.logFC.cohen” for cluster 1.
goi_mean <-
top_mean %>%
filter(Cluster == "cluster1") %>%
slice_max(order_by = mean.logFC.cohen,
n =3) %>%
pull(Gene)
goi_mean
## [1] "H1f7" "Irgc1" "Hmgb4"
goi_min <-
top_min %>%
filter(Cluster == "cluster1") %>%
slice_max(order_by = min.logFC.cohen,
n =3) %>%
pull(Gene)
goi_min
## [1] "1700017M07Rik" "Csrnp1" "Tmem144"
plot.top.mean <-
plotExpression(sce,
features=goi_mean,
x="clust.20", colour_by="clust.20") +
ggtitle("top mean.logFC.cohen")
plot.top.min <-
plotExpression(sce,
features=goi_min,
x="clust.20", colour_by="clust.20") +
ggtitle("top min.logFC.cohen")
plot.both <- plot.top.mean / plot.top.min
plot.both
ggsave(plot.both,
file = paste0(results_dir, "/cluster1_top_genes.png"),
height = 12, width = 10)
One common way to assign cell type labels to single cells is using
pre-defined cell-type marker genes. Specifically, we use the
AUCell package to identify marker sets that are highly
expressed in each cell. The marker genes should be formatted as a list
where each element of the list is a character vector of genes for each
cell type. I’ll make this list from the marker genes identified by
Lukassen et al. in SuppTableGenes.xlsx
.
SuppTableGenes.xlsx
is an example of a “messy” data
frame. It’s often easier to deal with data in a “tidy” format.
In Tidy data:
We’ll use functions from the dplyr
package in the Tidyverse to reformat
SuppTableGenes.xlsx
into a table with the following
columns:
Where each row represents the log fold change and p-value of a gene in a specific cell type versus all other cell types. Then, we’ll filter for positive fold changes, select the 100 genes with the smalest p-value in each cluster, and turn the data frame into a list of data frames.
# read in the cluster enriched genes from Lukassen
sup_tab <- read_excel(paste0(data_dir, "/Lukassen_testes/SuppTableGenes.xlsx"))
glimpse(sup_tab)
## Rows: 27,790
## Columns: 29
## $ EnsemblID <chr> "ENSMUSG00000079701", "ENSMUSG000000…
## $ GeneName <chr> "Ssxb3", "Ssxb1", "Ssxb2", "Gm12695"…
## $ `RS1 Average` <dbl> 0.6408870, 1.9710779, 1.8309237, 1.1…
## $ `RS1 Log2 Fold Change vs all` <dbl> 5.018199, 4.160671, 4.043102, 3.8635…
## $ `RS1 P-Value` <dbl> 6.699313e-48, 4.186727e-45, 6.198828…
## $ `SC2 Average` <dbl> 0.038339822, 0.101548718, 0.09533145…
## $ `SC2 Log2 Fold Change vs all` <dbl> -2.601868, -2.955995, -2.957196, -2.…
## $ `SC2 P-Value` <dbl> 8.069916e-11, 4.570008e-16, 4.901922…
## $ `CS Average` <dbl> 0.002912556, 0.029125559, 0.03495067…
## $ `CS Log2 Fold Change vs all` <dbl> -5.092232, -4.342928, -4.010580, -4.…
## $ `CS P-Value` <dbl> 4.391694e-08, 2.714251e-09, 1.488424…
## $ `RS2 Average` <dbl> 0.009164249, 0.241936166, 0.26942891…
## $ `RS2 Log2 Fold Change vs all` <dbl> -4.27290084, -1.42161408, -1.1571368…
## $ `RS2 P-Value` <dbl> 1.207434e-10, 3.110400e-03, 1.837556…
## $ `ES Average` <dbl> 0.003711412, 0.022268474, 0.00371141…
## $ `ES Log2 Fold Change vs all` <dbl> -4.705656, -4.611644, -6.333842, -3.…
## $ `ES P-Value` <dbl> 7.676492e-06, 1.277472e-07, 5.895288…
## $ `SC1 Average` <dbl> 0.005288957, 0.037022699, 0.02644478…
## $ `SC1 Log2 Fold Change vs all` <dbl> -4.156645, -3.869184, -4.196458, -4.…
## $ `SC1 P-Value` <dbl> 8.254880e-04, 2.035123e-04, 8.704576…
## $ `Sertoli Average` <dbl> 0.0000000, 0.4383190, 0.0000000, 0.0…
## $ `Sertoli Log2 Fold Change vs all` <dbl> 0.023696955, -0.105260671, -1.602710…
## $ `Sertoli P-Value` <dbl> 1.0000000, 1.0000000, 1.0000000, 1.0…
## $ `Spg Average` <dbl> 0.0000000, 0.0000000, 0.0000000, 0.0…
## $ `Spg Log2 Fold Change vs all` <dbl> 0.3243484, -1.3911729, -1.3020588, -…
## $ `Spg P-Value` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ `Leydig Average` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ `Leydig Log2 Fold Change vs all` <dbl> 0.6014802, -1.1140411, -1.0249270, -…
## $ `Leydig P-Value` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
# list the 100 genes with the smallest p-val
# for each cell type
top_lukassen_df <-
sup_tab %>%
# Select columns that include "GeneName", "P-Value", and "Log2" (fold change columns)
select(GeneName, contains("P-Value"), contains("Log2")) %>%
# Remove duplicate rows based on the "GeneName" column
filter(!duplicated(GeneName)) %>%
# Convert the data frame from wide to long format
pivot_longer(cols = -GeneName, # Keep "GeneName" as is, and reshape other columns
names_to = "CellType", # Create a new column "CellType" from the names of reshaped columns
values_to = "val") %>% # Store the values in a new column "val"
# Create a new column "Stat" that categorizes values as either "Pval" or "Log2FC"
mutate(Stat = case_when(
grepl("P-Value", CellType) ~ "Pval", # If "P-Value" is in the "CellType" column, set "Stat" as "Pval"
grepl("Log2", CellType) ~ "Log2FC")) %>% # If "Log2" is in the "CellType" column, set "Stat" as "Log2FC"
# Clean up the "CellType" names by removing unnecessary strings
mutate(CellType = gsub(" P-Value", "", CellType),
CellType = gsub(" Log2 Fold Change vs all", "", CellType)) %>%
# Convert the data back from long to wide format
pivot_wider(names_from = Stat, # Use "Stat" column (which contains "Pval" or "Log2FC") as the new column names
values_from = val) %>% # Use values from "val" column as the data for the new columns
# Filter the data to keep only rows where "Log2FC" is positive
filter(Log2FC > 0) %>%
# Group the data by "CellType"
group_by(CellType) %>%
# Select the top 100 rows with the smallest "Pval" within each "CellType" group
slice_min(Pval, n = 100) %>%
# Remove the grouping structure from the data frame
ungroup() %>%
# Filter out any remaining duplicate "GeneName" entries, keeping only one occurrence
filter(!duplicated(GeneName))
top_lukassen_list <-
top_lukassen_df %>%
# Group the data by the "CellType" column
group_by(CellType) %>%
# Split the grouped data frame into a list of data frames
# Each element in the list corresponds to a unique "CellType"
split(f = as.factor(.$CellType))
# pull just the genes to use with AUCell
for_aucell <- lapply(top_lukassen_list, function(x) x %>% pull(GeneName))
Now, we’ll use AUCell_buildRankings()
to rank genes by
their expression value in each cell. Then AUCell_calcAUC
will quantify the enrichment of the top Lukassen markers in each cell.
Higher scores correspond to higher enrichment.
rankings <- AUCell_buildRankings(as.matrix(counts(sce)),
plotStats=FALSE,
verbose=FALSE)
cell.aucs <- AUCell_calcAUC(for_aucell, rankings)
Next, we’ll check how well the marker genes discriminate cell types. In heterogeneous populations, the distribution for each label should be bimodal with one high-scoring peak containing cells of that cell type and a low-scoring peak containing cells of other types. In populations where a particular cell type is expected, lack of clear bimodality for the corresponding label may indicate that its gene set is not sufficiently informative.
par(mfrow=c(3,3))
AUCell_exploreThresholds(cell.aucs, plotHist=TRUE, assign=TRUE)
# save the plot
pdf(file = paste0(results_dir, "/aucell_check.pdf"))
par(mfrow=c(3,3))
AUCell_exploreThresholds(cell.aucs, plotHist=TRUE, assign=TRUE)
dev.off()
Based on the examples
from the AUCell
vignette, these look as good as random gene
sets. But let’s see what the results look like.
Assign the label with the highest score to each cell.
results <- t(assay(cell.aucs))
# Take the marker set with the top AUC as the label for that cell.
aucell.labels <- colnames(results)[max.col(results)]
# add the labels to the sce
colData(sce) <- cbind(colData(sce), aucell.labels)
# plot the labels on the tnse plot
tsne.lab <- plotTSNE(sce, colour_by="aucell.labels")
# plot the clusters in tsne space for reference
tsne.20 <- plotTSNE(sce, colour_by="clust.20")
# combine the plots with patchwork
plot.all <- tsne.lab | tsne.20
plot.all
# Save the tsne plot as a PNG file in the specified results directory
ggsave(plot.all, file = paste0(results_dir, "/TSNE_aucell_labels.png"), width = 8, height=4)
These actually look pretty reasonable!
We can also assign cell-type labels to clusters of cells instead of
individual cells. AUCell
works best with large numbers of
markers, but you might not always have access to that information. This
is an example of using a short list of cell-type markers to label cell
clusters instead of individual cells. We’ll make a heatmap that shows
the mean.logFC.cohen
of the marker genes from figure 2 in
each cluster. The rows represent marker genes. The columns are clusters.
The genes are labeled by the cells they are expressed in according to
supplemental table 3. If a cluster has a high
mean.logFC.cohen
for a specific set of cell-type marker
genes, the cluster may represent that cell type.
# This is a df I made using info from supplemental table 3
fig2 <- read.delim(paste0(data_dir, "/Lukassen_testes/fig2_genes.txt"))
# Set row names to the values in the 'Gene.name' column
rownames(fig2) <- fig2$Gene.name
# Remove the 'Gene.name' column from the data frame since it's now used as row names
fig2$Gene.name <- NULL
# Replace all occurrences of 0 with "NO" in the data frame
fig2[fig2 == 0] <- "NO"
# Replace all occurrences of 1 with "YES" in the data frame
fig2[fig2 == 1] <- "YES"
# Define color mappings for each cell type based on the values "YES" (black) and "NO" (white)
ann_colors <- list(
Early.Sgonia = c(YES="black", NO="white"),
Late.Sgonia = c(YES="black", NO="white"),
Early.Scytes = c(YES="black", NO="white"),
Late.Scytes = c(YES="black", NO="white"),
Round.Stids = c(YES="black", NO="white"),
Later.Stids = c(YES="black", NO="white"),
Sertoli = c(YES="black", NO="white"),
Leydig = c(YES="black", NO="white")
)
# This retrieves the "mean.logFC.cohen" of
# every gene for each of our clusters
all <- saveTopGenesDF(markers20, nrow(markers20[[1]]), "mean.logFC.cohen")
# Filter the 'all' data frame to keep only the marker genes of interest that are present in 'fig2'
goi <- all %>% filter(Gene %in% rownames(fig2))
# Reshape the data frame from long to wide format
# This spreads the "mean.logFC.cohen" values across clusters for each gene
goi_wide <-
goi %>%
pivot_wider(id_cols = Gene,
names_from = Cluster,
values_from = mean.logFC.cohen)
# Convert the wide-format tibbl to a regular data frame
goi_wide <- as.data.frame(goi_wide)
# Set the row names of the data frame to the gene names
rownames(goi_wide) <- goi_wide$Gene
# Remove the 'Gene' column since it's now used as row names
goi_wide$Gene <- NULL
# Define a color palette for the heatmap using a reversed "RdBu" color scheme
color_pal <- colorRampPalette(rev(brewer.pal(n = 11, name ="RdBu")))(20)
# Define breaks for the color scaling in the heatmap
# The breaks are calculated to have less emphasis on the values close to zero
myBreaks <- c(seq(min(goi_wide), 0, length.out=ceiling(20/2) + 1),
seq(max(goi_wide)/20, max(goi_wide),
length.out=floor(20/2)))
# Generate the heatmap with specified settings
pheatmap(goi_wide,
color = color_pal,
border_color = NA, # No border color around heatmap cells
cluster_cols = TRUE, # Cluster columns (i.e., gene clusters)
cluster_rows = TRUE, # Cluster rows (i.e., sample clusters)
show_rownames = TRUE, # Show row names (gene names)
annotation_row = fig2, # Add annotations from the fig2 data frame
annotation_colors = ann_colors, # Apply the custom color mapping to the annotations
breaks=myBreaks) # Use the custom breaks for color scaling
# Create a PDF file to save the heatmap
pdf(file = paste0(results_dir, "/fig2_heatmap.pdf"))
pheatmap(goi_wide,
color = color_pal,
border_color = NA,
cluster_cols = TRUE,
cluster_rows = TRUE,
show_rownames = TRUE,
annotation_row = fig2,
annotation_colors = ann_colors,
breaks=myBreaks)
# Close the PDF device, saving the heatmap
dev.off()
## pdf
## 3
Based on the heatmap, these could be reasonable labels for the clusters:
Cluster | CellType |
---|---|
Cluster1 | Round Stids |
Cluster2 | Sgonia |
Cluster3 | Scytes |
Cluster4 | Scytes |
Cluster5 | Scytes |
Cluster6 | Sgonia |
Cluster7 | Sgonia |
Cluster8 | Later Stids |
Cluster9 | Round Stids |
Cluster10 | Later Stids |
Cluster11 | Late Scytes |
Cluster12 | Sgonia |
Add them to the sce
and plot.
# make a vector of the cluster labels in order of the cluster number
type_vec <- c("Round Stids", "Sgonia", "Scytes",
"Scytes", "Scytes", "Sgonia",
"Sgonia", "Later Stids", "Round Stids",
"Later Stids", "Late Scytes", "Sgonia")
# map the cell type lable of every cell to the cluster number
# of every cell
label_vec <- type_vec[colData(sce)$clust.20]
# add the cell type lables to coData
colData(sce)$fig2.labels <- label_vec
# plot
tsne.lab <- plotTSNE(sce, colour_by="fig2.labels")
ggsave(file = paste0(results_dir, "/TSNE_fig2.labels.pdf"),
width=8, height=4)
tsne.lab | tsne.20
Use aggregateAcrossCells()
from the scuttle package to
get the mean expression of genes within groups of cells. We’ll find the
mean by “fig2.labels” as an example.
mean_sce <- summarizeAssayByGroup(sce,
ids = sce$fig2.labels,
statistics = "mean",
assay.type = "logcounts")
# pull the mean assay into a data frame
mean_df <- assay(mean_sce, "mean")
head(mean_df)
## Late Scytes Later Stids Round Stids Scytes Sgonia
## Gm29155 0.00000000 0.01798324 0.02397328 0.0007110242 0.00000000
## Pcmtd1 0.01489881 0.00000000 0.00000000 0.0024451762 0.00000000
## Gm26901 0.09557378 0.04411660 0.02758087 0.0865191129 0.08797026
## Cdh7 0.00000000 0.00000000 0.00000000 0.0000000000 0.00000000
## Exo1 0.10075434 0.00000000 0.03333784 0.1070970622 0.04743240
## Becn2 0.00000000 0.00000000 0.00000000 0.0000000000 0.00000000
# write it to a csv file
write.csv(mean_df, file = paste0(results_dir, "/mean_logcounts_by_fig2.labels.csv"))
If authors of a study have deposited the count matrix they used in GEO, it’s often more efficient to use that than to re-process the raw sequencing files.
Using deposited data will also allow you to more accurately reproduce the original study. Often, only the cells used in the study after filtering for empty drops, outliers, and low quality cells are included in the deposited data, so there’s no need to filter yourself.
If the authors also include a meta data file that lists a cell-type label for each cell, you will not have to repeat the clustering and annotation steps.
Files deposited by the authors
/data/Lukassen_testes/GSE104556_matrix.mtx.gz
: From GEO
(GSE104556).
The gene x barcode UMI count matrix./data/Lukassen_testes/GSE104556_genes.tsv.gz
: From GEO
(GSE104556).
The genes that correspond to the rows of the UMI count matrix./data/Lukassen_testes/GSE104556_barcodes.tsv.gz
: From
GEO (GSE104556).
The barcodes that correspond to the columns of the UMI count
matrix./data/Lukassen_testes/SuppTableCells.xlsx
: From figshare.
Cell meta data including the cell-type labels./data/Lukassen_testes/SuppTableGenes.xlsx
: From figshare.
Gene meta data including gene names.Load the count matrix.
# Read the count matrix data from a Matrix Market file (.mtx.gz) and store it in the 'counts' variable
counts <- readMM(paste0(data_dir, "/Lukassen_testes/GSE104556_matrix.mtx.gz"))
# Display the dimensions of the 'counts' matrix (number of genes x number of cells)
dim(counts)
## [1] 51868 2552
# Read the barcodes file (which identifies individual cells) from a tab-delimited file and store it in 'barcodes'
barcodes <- read.delim(paste0(data_dir, "/Lukassen_testes/GSE104556_barcodes.tsv.gz"), header = FALSE)
# Display the dimensions of the 'barcodes' data (number of cells)
dim(barcodes)
## [1] 2552 1
# Display the first few rows of the 'barcodes' data
head(barcodes)
## V1
## 1 AAACCTGAGCTTATCG-1
## 2 AAACCTGGTTGAGTTC-1
## 3 AAACCTGTCAACGAAA-1
## 4 AAACGGGCACAGGTTT-1
## 5 AAACGGGTCATTTGGG-1
## 6 AAACGGGTCCTCATTA-1
# Display the last few rows of the 'barcodes' data
tail(barcodes)
## V1
## 2547 TTTGGTTGTGCAACTT-2
## 2548 TTTGGTTTCTACTATC-2
## 2549 TTTGTCACATCGGTTA-2
## 2550 TTTGTCAGTCCAGTTA-2
## 2551 TTTGTCATCTGTCCGT-2
## 2552 TTTGTCATCTTTCCTC-2
# The barcodes have "-1" or "-2" appended to them, corresponding to cells derived from mouse 1 or mouse 2 in the study
# Assign the barcodes as the column names of the 'counts' matrix
colnames(counts) <- barcodes$V1
# Read the gene information from a tab-delimited file and store it in 'genes'
genes <- read.delim(paste0(data_dir, "/Lukassen_testes/GSE104556_genes.tsv.gz"), header = FALSE)
# Display the dimensions of the 'genes' data (number of genes)
dim(genes)
## [1] 51868 2
# Display the first few rows of the 'genes' data
head(genes)
## V1 V2
## 1 ENSMUSG00000102693 4933401J01Rik
## 2 ENSMUSG00000064842 Gm26206
## 3 ENSMUSG00000051951 Xkr4
## 4 ENSMUSG00000102851 Gm18956
## 5 ENSMUSG00000103377 Gm37180
## 6 ENSMUSG00000104017 Gm37363
# Assign the gene names as the row names of the 'counts' matrix
rownames(counts) <- genes$V1
# Display the first few rows of the first 5 columns of the 'counts' matrix
head(counts[, 1:5])
## 6 x 5 sparse Matrix of class "dgTMatrix"
## AAACCTGAGCTTATCG-1 AAACCTGGTTGAGTTC-1 AAACCTGTCAACGAAA-1
## ENSMUSG00000102693 . . .
## ENSMUSG00000064842 . . .
## ENSMUSG00000051951 . . .
## ENSMUSG00000102851 . . .
## ENSMUSG00000103377 . . .
## ENSMUSG00000104017 . . .
## AAACGGGCACAGGTTT-1 AAACGGGTCATTTGGG-1
## ENSMUSG00000102693 . .
## ENSMUSG00000064842 . .
## ENSMUSG00000051951 . .
## ENSMUSG00000102851 . .
## ENSMUSG00000103377 . .
## ENSMUSG00000104017 . .
# get the number of cells each gene is expressed in
rs <- rowSums(counts!=0)
# remove genes expressed 0 cells
filt_mat <- counts[rs>0,]
Load the cell and gene meta data.
# Load gene data from an Excel file into a data frame
sup_genes <- read_excel(paste0(data_dir, "/Lukassen_testes/SuppTableGenes.xlsx"))
# Quickly view the structure and content of the 'sup_genes' data frame
glimpse(sup_genes)
## Rows: 27,790
## Columns: 29
## $ EnsemblID <chr> "ENSMUSG00000079701", "ENSMUSG000000…
## $ GeneName <chr> "Ssxb3", "Ssxb1", "Ssxb2", "Gm12695"…
## $ `RS1 Average` <dbl> 0.6408870, 1.9710779, 1.8309237, 1.1…
## $ `RS1 Log2 Fold Change vs all` <dbl> 5.018199, 4.160671, 4.043102, 3.8635…
## $ `RS1 P-Value` <dbl> 6.699313e-48, 4.186727e-45, 6.198828…
## $ `SC2 Average` <dbl> 0.038339822, 0.101548718, 0.09533145…
## $ `SC2 Log2 Fold Change vs all` <dbl> -2.601868, -2.955995, -2.957196, -2.…
## $ `SC2 P-Value` <dbl> 8.069916e-11, 4.570008e-16, 4.901922…
## $ `CS Average` <dbl> 0.002912556, 0.029125559, 0.03495067…
## $ `CS Log2 Fold Change vs all` <dbl> -5.092232, -4.342928, -4.010580, -4.…
## $ `CS P-Value` <dbl> 4.391694e-08, 2.714251e-09, 1.488424…
## $ `RS2 Average` <dbl> 0.009164249, 0.241936166, 0.26942891…
## $ `RS2 Log2 Fold Change vs all` <dbl> -4.27290084, -1.42161408, -1.1571368…
## $ `RS2 P-Value` <dbl> 1.207434e-10, 3.110400e-03, 1.837556…
## $ `ES Average` <dbl> 0.003711412, 0.022268474, 0.00371141…
## $ `ES Log2 Fold Change vs all` <dbl> -4.705656, -4.611644, -6.333842, -3.…
## $ `ES P-Value` <dbl> 7.676492e-06, 1.277472e-07, 5.895288…
## $ `SC1 Average` <dbl> 0.005288957, 0.037022699, 0.02644478…
## $ `SC1 Log2 Fold Change vs all` <dbl> -4.156645, -3.869184, -4.196458, -4.…
## $ `SC1 P-Value` <dbl> 8.254880e-04, 2.035123e-04, 8.704576…
## $ `Sertoli Average` <dbl> 0.0000000, 0.4383190, 0.0000000, 0.0…
## $ `Sertoli Log2 Fold Change vs all` <dbl> 0.023696955, -0.105260671, -1.602710…
## $ `Sertoli P-Value` <dbl> 1.0000000, 1.0000000, 1.0000000, 1.0…
## $ `Spg Average` <dbl> 0.0000000, 0.0000000, 0.0000000, 0.0…
## $ `Spg Log2 Fold Change vs all` <dbl> 0.3243484, -1.3911729, -1.3020588, -…
## $ `Spg P-Value` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ `Leydig Average` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ `Leydig Log2 Fold Change vs all` <dbl> 0.6014802, -1.1140411, -1.0249270, -…
## $ `Leydig P-Value` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
# Convert the 'sup_genes' tibble (from read_excel) into a standard data frame
sup_genes <- as.data.frame(sup_genes)
# Set the Ensembl ID column as the row names of the 'sup_genes' data frame
rownames(sup_genes) <- sup_genes$EnsemblID
# Remove the 'EnsemblID' column now that it is used as row names
sup_genes$EnsemblID <- NULL
# Load cell data from another Excel file into a data frame
sup_cells <- read_excel(paste0(data_dir, "/Lukassen_testes/SuppTableCells.xlsx"))
# Quickly view the structure and content of the 'sup_cells' data frame
glimpse(sup_cells)
## Rows: 2,552
## Columns: 13
## $ Barcode <chr> "AAACCTGAGCTTATCG-1", "AAACCTGGTTGAGTTC-1", "A…
## $ Replicate <chr> "mouse 1", "mouse 1", "mouse 1", "mouse 1", "m…
## $ `TSNE-1` <dbl> 1.138438, 6.849414, -23.578904, -26.702202, -4…
## $ `TSNE-2` <dbl> -32.931800, -34.013543, -13.438399, -11.977982…
## $ UMICount <dbl> 23114, 19002, 22849, 27921, 18466, 15074, 1651…
## $ GeneCount <dbl> 4977, 4694, 4908, 5392, 4718, 3714, 4425, 7810…
## $ ProportionMitochondrial <dbl> 3.461555e-04, 4.210748e-04, 4.377517e-05, 1.43…
## $ Cluster <dbl> 4, 1, 4, 4, 5, 5, 1, 2, 4, 2, 2, 5, 3, 2, 2, 2…
## $ CellType <chr> "RS2", "RS1", "RS2", "RS2", "ES", "ES", "RS1",…
## $ MonoclePseudotime <dbl> 28.0360268, 26.6232265, 33.4668827, 34.5019604…
## $ MonocleRank <dbl> 1502, 1409, 1777, 1805, 1881, 1931, 1045, 257,…
## $ ScratPseudotime <dbl> 0.019297217, 0.021289324, 0.014213382, 0.01407…
## $ ScratRank <dbl> 1474, 1393, 1767, 1782, 1899, 1951, 1048, 511,…
# Convert the 'sup_cells' tibble (from read_excel) into a standard data frame
sup_cells <- as.data.frame(sup_cells)
# Set the Barcode column as the row names of the 'sup_cells' data frame
rownames(sup_cells) <- sup_cells$Barcode
# Remove the 'Barcode' column now that it is used as row names
sup_cells$Barcode <- NULL
Create a SingleCellExperiment
object using the filtered
matrix, with the counts stored in the ‘counts’ assay. Add
sup_cells
to the colData slot. Add gene names to the
rowData slot.
sce <- SingleCellExperiment(assays = list(counts = filt_mat), colData = sup_cells)
Add the gene meta data to the sce
object.
# Reorder the rows of 'sup_genes' to match the rownames of the 'sce' object
# This ensures that the gene names and associated data in 'sup_genes' are aligned with the data in 'sce'
sup_genes <- sup_genes[match(rownames(sce), rownames(sup_genes)),]
# Add a new column to the 'rowData' of the 'sce' object to store the gene names from 'sup_genes'
rowData(sce)$GeneNames <- sup_genes$GeneName
If you want, you can switch the rownames(sce)
from
ensembl IDs to gene names so that other functions will operate on the
gene names instead of ensembl IDs.
# Add Ensembl IDs to 'rowData' to preserve them, storing them under a new column 'GeneIDs'
rowData(sce)$GeneIDs <- rownames(sce)
# Replace the row names of 'sce' with the gene names stored in the 'GeneNames' column of 'rowData'
# This makes gene names (rather than Ensembl IDs) the primary row identifiers in 'sce'
rownames(sce) <- rowData(sce)$GeneNames
# Display the 'sce' object to confirm the changes
sce
## class: SingleCellExperiment
## dim: 27790 2552
## metadata(0):
## assays(1): counts
## rownames(27790): 4933401J01Rik Gm38385 ... ZP3AR_Satellite_Muridae
## ZP3AR_MM_Satellite_Mus
## rowData names(2): GeneNames GeneIDs
## colnames(2552): AAACCTGAGCTTATCG-1 AAACCTGGTTGAGTTC-1 ...
## TTTGTCATCTGTCCGT-2 TTTGTCATCTTTCCTC-2
## colData names(12): Replicate TSNE-1 ... ScratPseudotime ScratRank
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
At this point, you can perform the following analyses from above:
#normalize
clusters <- quickCluster(sce)
sce <- computeSumFactors(sce, cluster=clusters)
sce <- logNormCounts(sce)
#feature selection
rowSubset(sce, "HVGs") <- getTopHVGs(sce, n=1000)
# PCA
sce <- runPCA(sce, subset_row=rowSubset(sce, "HVGs"))
percent.var <- attr(reducedDim(sce), "percentVar")
chosen.elbow <- findElbowPoint(percent.var)
reducedDim(sce, "PCA.elbow") <- reducedDim(sce, "PCA")[,1:chosen.elbow]
# runTSNE
sce <- runTSNE(sce, dimred="PCA.elbow")
Make a TSNE plot with the cell types from the paper.
tsne.CellType <- plotTSNE(sce,
colour_by="CellType")
tsne.CellType
# Save the tsne plot as a PNG file in the specified results directory
ggsave(tsne.CellType, file = paste0(results_dir, "/TSNE_Lukassen_CellTypes.png"), width = 4, height=4)
In this case, the colData already includes the first two TSNE coordinates identified by the authors, so there isn’t really a need to do anything other than Normalization and Transformation in order to repeat their analyses. But including this information is uncommon.
Plot the author’s TSNE coordinates.
# Extract the t-SNE coordinates (TSNE-1 and TSNE-2 columns) from the 'sup_cells' data frame
paper_tsne <- sup_cells[, c("TSNE-1", "TSNE-2")]
# Assign the extracted t-SNE coordinates to the 'sce' SingleCellExperiment object
# This adds 'paper_tsne' as a reduced dimension slot in 'sce' named "paper_tsne"
reducedDim(sce, "paper_tsne") <- paper_tsne
paper_tsne.CellType <- plotReducedDim(sce,
colour_by ="CellType",
dimred = "paper_tsne")
paper_tsne.CellType
# Save the tsne plot as a PNG file in the specified results directory
ggsave(paper_tsne.CellType, file = paste0(results_dir, "/Lukassen_TSNE_Lukassen_CellTypes.png"), width = 4, height=4)
Get the average expression of each gene in each cluster.
mean_sce <- summarizeAssayByGroup(sce,
ids = sce$CellType,
statistics = "mean",
assay.type = "logcounts")
# pull the mean assay into a data frame
mean_df <- assay(mean_sce, "mean")
head(mean_df)
## CS ES Leydig RS1 RS2 SC1
## 4933401J01Rik 0.000000000 0.01570434 0.0000000 0.01466208 0.01895485 0.00000000
## Gm38385 0.000000000 0.00000000 0.0000000 0.00000000 0.00000000 0.00000000
## Rp1 0.026417045 0.09468464 0.0000000 0.12953817 0.10351288 0.34370331
## Sox17 0.023717340 0.04022804 0.0000000 0.29172019 0.07635653 0.01826828
## Gm37587 0.018696526 0.03016153 0.4500544 0.12399087 0.09324875 0.01535353
## Gm37323 0.007725922 0.07567644 0.0000000 0.00183389 0.01159152 0.00000000
## SC2 Sertoli Spg
## 4933401J01Rik 0.000000000 0 0.0000000
## Gm38385 0.003460472 0 0.0000000
## Rp1 0.322834514 0 0.0000000
## Sox17 0.415854932 0 0.0000000
## Gm37587 0.134952815 0 0.1553769
## Gm37323 0.002597593 0 0.0000000
# write it to a csv file
write.csv(mean_df, file=paste0(results_dir, "/mean_logcounts_by_Lukassen_CellTypes.csv"))
sessionInfo()
## R version 4.3.2 (2023-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS/LAPACK: FlexiBLAS OPENBLAS; LAPACK version 3.11.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: US/Eastern
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] readxl_1.4.3 RColorBrewer_1.1-3
## [3] pheatmap_1.0.12 AUCell_1.24.0
## [5] scDblFinder_1.16.0 bluster_1.12.0
## [7] PCAtools_2.14.0 ggrepel_0.9.4
## [9] patchwork_1.1.3 Matrix_1.6-4
## [11] DropletUtils_1.22.0 scran_1.30.2
## [13] scater_1.30.1 scuttle_1.12.0
## [15] SingleCellExperiment_1.24.0 SummarizedExperiment_1.32.0
## [17] Biobase_2.62.0 GenomicRanges_1.54.1
## [19] GenomeInfoDb_1.38.8 IRanges_2.36.0
## [21] S4Vectors_0.40.2 BiocGenerics_0.48.1
## [23] MatrixGenerics_1.14.0 matrixStats_1.1.0
## [25] lubridate_1.9.3 forcats_1.0.0
## [27] stringr_1.5.0 dplyr_1.1.4
## [29] purrr_1.0.2 readr_2.1.4
## [31] tidyr_1.3.0 tibble_3.2.1
## [33] ggplot2_3.4.4 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] splines_4.3.2 later_1.3.1
## [3] BiocIO_1.12.0 bitops_1.0-7
## [5] R.oo_1.25.0 cellranger_1.1.0
## [7] graph_1.80.0 XML_3.99-0.16
## [9] lifecycle_1.0.3 edgeR_4.0.16
## [11] lattice_0.22-5 MASS_7.3-60
## [13] magrittr_2.0.3 plotly_4.10.3
## [15] limma_3.58.1 sass_0.4.7
## [17] rmarkdown_2.25 jquerylib_0.1.4
## [19] yaml_2.3.7 metapod_1.10.1
## [21] httpuv_1.6.12 cowplot_1.1.1
## [23] DBI_1.1.3 abind_1.4-5
## [25] zlibbioc_1.48.2 Rtsne_0.17
## [27] mixtools_2.0.0 R.utils_2.12.3
## [29] RCurl_1.98-1.13 GenomeInfoDbData_1.2.11
## [31] irlba_2.3.5.1 annotate_1.80.0
## [33] dqrng_0.4.0 DelayedMatrixStats_1.24.0
## [35] codetools_0.2-19 DelayedArray_0.28.0
## [37] tidyselect_1.2.0 farver_2.1.1
## [39] ScaledMatrix_1.10.0 viridis_0.6.4
## [41] GenomicAlignments_1.38.2 jsonlite_1.8.7
## [43] BiocNeighbors_1.20.2 ellipsis_0.3.2
## [45] survival_3.5-7 systemfonts_1.0.5
## [47] segmented_2.0-0 tools_4.3.2
## [49] ragg_1.2.6 Rcpp_1.0.11
## [51] glue_1.6.2 gridExtra_2.3
## [53] SparseArray_1.2.4 xfun_0.41
## [55] HDF5Array_1.30.1 withr_2.5.2
## [57] fastmap_1.1.1 rhdf5filters_1.14.1
## [59] fansi_1.0.5 digest_0.6.33
## [61] rsvd_1.0.5 timechange_0.2.0
## [63] R6_2.5.1 mime_0.12
## [65] textshaping_0.3.7 colorspace_2.1-0
## [67] RSQLite_2.3.4 R.methodsS3_1.8.2
## [69] utf8_1.2.4 generics_0.1.3
## [71] data.table_1.14.10 FNN_1.1.3.2
## [73] rtracklayer_1.62.0 htmlwidgets_1.6.2
## [75] httr_1.4.7 S4Arrays_1.2.1
## [77] uwot_0.2.2 pkgconfig_2.0.3
## [79] gtable_0.3.4 blob_1.2.4
## [81] XVector_0.42.0 htmltools_0.5.7
## [83] GSEABase_1.64.0 scales_1.3.0
## [85] png_0.1-8 knitr_1.45
## [87] rstudioapi_0.15.0 tzdb_0.4.0
## [89] reshape2_1.4.4 rjson_0.2.21
## [91] nlme_3.1-164 cachem_1.0.8
## [93] rhdf5_2.46.1 parallel_4.3.2
## [95] vipor_0.4.5 AnnotationDbi_1.64.1
## [97] restfulr_0.0.15 pillar_1.9.0
## [99] grid_4.3.2 vctrs_0.6.4
## [101] promises_1.2.1 BiocSingular_1.18.0
## [103] beachmat_2.18.1 xtable_1.8-4
## [105] cluster_2.1.6 beeswarm_0.4.0
## [107] evaluate_0.23 cli_3.6.1
## [109] locfit_1.5-9.8 compiler_4.3.2
## [111] Rsamtools_2.18.0 rlang_1.1.2
## [113] crayon_1.5.2 labeling_0.4.3
## [115] plyr_1.8.9 ggbeeswarm_0.7.2
## [117] stringi_1.7.12 viridisLite_0.4.2
## [119] BiocParallel_1.36.0 munsell_0.5.0
## [121] Biostrings_2.70.3 lazyeval_0.2.2
## [123] hms_1.1.3 sparseMatrixStats_1.14.0
## [125] bit64_4.0.5 Rhdf5lib_1.24.2
## [127] KEGGREST_1.42.0 statmod_1.5.0
## [129] shiny_1.7.5.1 highr_0.10
## [131] kernlab_0.9-32 igraph_1.5.1
## [133] memoise_2.0.1 bslib_0.5.1
## [135] bit_4.0.5 xgboost_1.7.6.1