scRNA-seq analysis in R

OSCA: Orchestrating Single-Cell Analysis with Bioconductor

This tutorial is based off of the “Orchestrating Single-Cell Analysis with Bioconductor” e-book, a comprehensive resource designed to guide users through the process of analyzing single-cell RNA sequencing (scRNA-seq) data using the Bioconductor ecosystem in R. Links to more detailed explanations of the material are included in the tutorial.

Outline

In this tutorial you will learn to:

Make a SingleCellExperiment object from count data derived from Kaslisto-Bustools or procressed counts downloaded from GEO.
Detect empty droplets with DropletUtils.
Detect barecodes that correspond to ‘doublets’.
Identify and remove low quality cells from your data.
Normalize and transform the raw counts for downstream analysis
Select highly variable genes in the data.
Perform dimesnionality reductions using PCA, tSNE and UMAP.
Cluster the cells based on their gene expression profiles.
Identify cluster enriched genes.
Assign cell labels from gene sets.
Calculate mean gene expression per clusters or cell type.
Visualize your results.

Set up

Open the R environment on HPCC

module purge
module load R-bundle-CRAN/2023.12-foss-2023a
R --vanilla

Load libraries

library(tidyverse)
library(SingleCellExperiment)
library(scater)
library(scran)
library(DropletUtils)
library(Matrix)
library(patchwork)
library(PCAtools)
library(bluster)
library(scDblFinder)
library(AUCell)
library(pheatmap)
library(RColorBrewer)
library(readxl)

Set paths

data_dir <- "/mnt/research/bioinformaticsCore/projects/petroffm/BCC105_scRNAseq_training/data"

results_dir <- "/mnt/research/bioinformaticsCore/projects/petroffm/BCC105_scRNAseq_training/results/Lukassen_testes"

# if results_dir doesn't exist, make it
if(!dir.exists(results_dir)){dir.create(results_dir)}

Read in the count data

# Read in the count matrix that was output by `kb`.
mat <- readMM(paste0(data_dir, "/Lukassen_testes/SRR6129050_mouse1/counts_unfiltered/cells_x_genes.mtx"))

dim(mat)

## [1] 681407  34183

head(mat[,1:5])

## 6 x 5 sparse Matrix of class "dgTMatrix"
##               
## [1,] . . . . .
## [2,] . . . . .
## [3,] . . . . .
## [4,] . . . . .
## [5,] . . . . .
## [6,] . . . . .

# read in the gene names
gene_names = read.delim(paste0(data_dir,"/Lukassen_testes/SRR6129050_mouse1/counts_unfiltered/cells_x_genes.genes.names.txt"), header=F)

dim(gene_names)

## [1] 34183     1

head(gene_names)

##              V1
## 1       Gm15178
## 2       Gm29155
## 3 C730036E19Rik
## 4       Gm29157
## 5       Gm29156
## 6        Pcmtd1

# add genes as column names 
colnames(mat) <- gene_names$V1
head(mat[,1:5])

## 6 x 5 sparse Matrix of class "dgTMatrix"
##      Gm15178 Gm29155 C730036E19Rik Gm29157 Gm29156
## [1,]       .       .             .       .       .
## [2,]       .       .             .       .       .
## [3,]       .       .             .       .       .
## [4,]       .       .             .       .       .
## [5,]       .       .             .       .       .
## [6,]       .       .             .       .       .

# read in the cell barcodes
barcodes <- read.delim(paste0(data_dir,"/Lukassen_testes/SRR6129050_mouse1/counts_unfiltered/cells_x_genes.barcodes.txt"), header=F)

dim(barcodes)

## [1] 681407      1

head(barcodes)

##                 V1
## 1 AAACCTGAGAAACCAT
## 2 AAACCTGAGAAACCGC
## 3 AAACCTGAGAAACCTA
## 4 AAACCTGAGAAACGAG
## 5 AAACCTGAGAAACGCC
## 6 AAACCTGAGAAAGTGG

# add cell barcodes as row names
rownames(mat) <- barcodes$V1
head(mat[,1:5])

## 6 x 5 sparse Matrix of class "dgTMatrix"
##                  Gm15178 Gm29155 C730036E19Rik Gm29157 Gm29156
## AAACCTGAGAAACCAT       .       .             .       .       .
## AAACCTGAGAAACCGC       .       .             .       .       .
## AAACCTGAGAAACCTA       .       .             .       .       .
## AAACCTGAGAAACGAG       .       .             .       .       .
## AAACCTGAGAAACGCC       .       .             .       .       .
## AAACCTGAGAAAGTGG       .       .             .       .       .

# get the total counts per barcode
rs <- rowSums(mat)

# remove barcodes with no detected counts from the matrix
filt_mat <- mat[rs>0,]

# get total number of cells each gene is expressed in 
cells_exp = rowSums(t(filt_mat)!=0)

filt_mat <- filt_mat[,cells_exp>0]

Make a `SingleCellExperiment` object

A SingleCellExperiment object is a data structure in R, specifically designed for storing and managing single-cell RNA sequencing (scRNA-seq) data.

Key Features:

Data Storage: The SingleCellExperiment object holds the main data matrix, typically containing gene expression counts, where rows represent genes and columns represent cells.

Metadata: It includes slots for storing metadata related to both the rows (genes/features) and columns (cells/samples). This metadata can include information such as gene annotations or cell type labels.

Assays: The object supports multiple “assays,” allowing you to store different versions of the data (e.g., raw counts, normalized counts, or log-transformed data) within the same object.

Reduced Dimensions: It can store the results of dimensionality reduction techniques like PCA, t-SNE, or UMAP, making it easier to manage and visualize lower-dimensional representations of the data.

Cell and Feature-Level Data: The colData and rowData slots store additional cell-level and gene-level information, respectively, such as cell quality metrics, clustering results, or gene annotations.

Interoperability: SingleCellExperiment objects are designed to be compatible with a wide range of Bioconductor packages, facilitating various downstream analyses, including clustering, differential expression analysis, and trajectory inference.

Right now, we have a matrix where barcodes are rows and genes are columns. SingleCellExperiment() expects genes to be rows and barcodes to be columns, so we transpose the matrix with the t() function.

sce <- SingleCellExperiment(assays = list(counts = t(filt_mat)))

To access the count data we just supplied, we can do any one of the following:

assay(sce, “counts”) - this is the most general method, where we can supply the name of the assay as the second argument.
counts(sce) - this is a short-cut for the above, but only works for assays with the special name “counts”.

Detect empty droplets with `DropletUtils`

OSCA advanced: Chapter 7

For droplet-based data only (ie 10x and drop-seq)

Knee plot

A useful diagnostic for droplet-based data is the barcode rank plot, which shows the (log-)total UMI count for each barcode on the y-axis and the (log-)rank on the x-axis. This is effectively a transposed empirical cumulative density plot with log-transformed axes. It is useful as it allows users to examine the distribution of total counts across barcodes, focusing on those with the largest counts.

br.out <- barcodeRanks(counts(sce))
head(br.out)

## DataFrame with 6 rows and 3 columns
##                       rank     total    fitted
##                  <numeric> <integer> <numeric>
## AAACCTGAGAAACCAT    453160         3        NA
## AAACCTGAGAAACCGC    537144         2        NA
## AAACCTGAGAAACCTA    279308         6        NA
## AAACCTGAGAAACGAG    629267         1        NA
## AAACCTGAGAAACGCC    453160         3        NA
## AAACCTGAGAAAGTGG    279308         6        NA

# make the rownames a column
br.out$barcodes <- rownames(br.out)
head(br.out)

## DataFrame with 6 rows and 4 columns
##                       rank     total    fitted         barcodes
##                  <numeric> <integer> <numeric>      <character>
## AAACCTGAGAAACCAT    453160         3        NA AAACCTGAGAAACCAT
## AAACCTGAGAAACCGC    537144         2        NA AAACCTGAGAAACCGC
## AAACCTGAGAAACCTA    279308         6        NA AAACCTGAGAAACCTA
## AAACCTGAGAAACGAG    629267         1        NA AAACCTGAGAAACGAG
## AAACCTGAGAAACGCC    453160         3        NA AAACCTGAGAAACGCC
## AAACCTGAGAAAGTGG    279308         6        NA AAACCTGAGAAAGTGG

# Making the knee plot.
ggplot(as.data.frame(br.out), aes(x=rank, y=total)) +
  geom_point() +
  geom_hline(aes(yintercept=metadata(br.out)$knee,
             linetype="knee"),
             color = "dodgerblue") +
  geom_hline(aes(yintercept=metadata(br.out)$inflection,
             linetype="inflection"),
             color = "forestgreen") +
  scale_linetype_manual(name = "legend", 
                        values = c(2, 2), 
                        guide = guide_legend(override.aes = list(color = c("forestgreen", "dodgerblue")))) +
  scale_x_continuous(trans='log10') +
  scale_y_continuous(trans='log10') +
  xlab("Rank") +
  ylab("Total Counts")

ggsave(file = paste0(results_dir, "/kneeplot.png"))

Distinguish empty droplets from cells

Empty droplets often contain RNA from the ambient solution, resulting in non-zero counts after debarcoding. The emptyDrops function is designed to distinguish between empty droplets and cells. It does so by testing each barcode’s expression profile for significant deviation from the ambient profile.

# Set a seed. This can be any number. Using the same number every time ensures that a sequence of random numbers generated is the same every time you run your code.
set.seed(100) 

# set `by.rank` to the max number of cells you expect
e.out <- emptyDrops(counts(sce), 
                    by.rank = 2000) 
e.out

## DataFrame with 676046 rows and 5 columns
##                      Total   LogProb    PValue   Limited       FDR
##                  <integer> <numeric> <numeric> <logical> <numeric>
## AAACCTGAGAAACCAT         3        NA        NA        NA        NA
## AAACCTGAGAAACCGC         2        NA        NA        NA        NA
## AAACCTGAGAAACCTA         6        NA        NA        NA        NA
## AAACCTGAGAAACGAG         1        NA        NA        NA        NA
## AAACCTGAGAAACGCC         3        NA        NA        NA        NA
## ...                    ...       ...       ...       ...       ...
## TTTGTCATCTTTACAC         1        NA        NA        NA        NA
## TTTGTCATCTTTACGT         8        NA        NA        NA        NA
## TTTGTCATCTTTAGGG       734        NA        NA        NA        NA
## TTTGTCATCTTTAGTC         7        NA        NA        NA        NA
## TTTGTCATCTTTCCTC        21        NA        NA        NA        NA

# add barcodes to a column
e.out$barcode = rownames(e.out)

# make a vector of significant barcodes
# use a false discovery rate (FDR) of 0.1%,
# meaning that no more than 0.1% of our called 
# barcodes should be empty droplets on average.
sig_barcodes <- 
  as_tibble(e.out) %>%
  filter(FDR < .001) %>% 
  pull(barcode)

# how many cells?
length(sig_barcodes)

## [1] 1537

This is close to the number of cells used in the paper. Filter the sce for these cells.

sce <- sce[,sig_barcodes]

Identifying low quality cells

OSCA basic: Chapter 1

Common metrics include for identifying low quality cells include:

Total counts
Number of expressed features (genes)
Proportion of reads mapped to genes in the mitochondrial genome

For each cell, we calculate these QC metrics using the perCellQCMetrics() function from the scater package.

# Identifying the mitochondrial transcripts in our SingleCellExperiment.

# mouse mito genes start with "mt"
# human is "MT"
# change for different species
# this is a numeric vector of row numbers
# corresponding to mito genes
is.mito <- grep("^mt", rownames(sce))

# show the gene names
rownames(sce)[is.mito]

##  [1] "mt-Nd1"  "mt-Nd2"  "mt-Co1"  "mt-Co2"  "mt-Atp8" "mt-Atp6" "mt-Co3" 
##  [8] "mt-Nd3"  "mt-Nd4l" "mt-Nd4"  "mt-Nd5"  "mt-Nd6"  "mt-Cytb"

# get the per cell stats
stats <- perCellQCMetrics(sce,
                          subsets=list(Mito=is.mito))

head(stats)

## DataFrame with 6 rows and 6 columns
##                        sum  detected subsets_Mito_sum subsets_Mito_detected
##                  <numeric> <numeric>        <numeric>             <numeric>
## AAACCTGAGCTTATCG     24918      5292               18                     5
## AAACCTGGTTGAGTTC     20332      4995               18                     6
## AAACCTGTCAACGAAA     24765      5240                5                     3
## AAACCTGTCTCATTCA      7808      3083              557                    13
## AAACGGGCACAGGTTT     30331      5766                9                     6
## AAACGGGCACTCAGGC      8331      3295              746                    11
##                  subsets_Mito_percent     total
##                             <numeric> <numeric>
## AAACCTGAGCTTATCG            0.0722369     24918
## AAACCTGGTTGAGTTC            0.0885304     20332
## AAACCTGTCAACGAAA            0.0201898     24765
## AAACCTGTCTCATTCA            7.1337090      7808
## AAACGGGCACAGGTTT            0.0296726     30331
## AAACGGGCACTCAGGC            8.9545073      8331

‘sum’ is the the sum of counts for each cell. ‘detected’ is the number of genes with at least one count in each cell. ‘subsets_Mito_percent’ is the percent of the total counts (‘sum’) that come from mitochrondria genes: ‘subsets_Mito_sum’.

Summarize the stats

summary(stats$sum)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3760    6236   16705   19566   23319  134159

summary(stats$detected)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1144    2708    4324    4126    5098    8944

summary(stats$subsets_Mito_percent)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1065  0.6323  2.0879  1.4527 34.8936

Filtering with fixed thresholds

If paper that describes the data outlines the thresholds used to quality filter the cells, you can use those fixed thresholds to filter. For example:

# these aren't actually from the paper
# I made them up as an example
qc.counts <- stats$sum < 5000
qc.nexprs <- stats$detected < 2000
qc.mito <- stats$subsets_Mito_percent > 10
discard <- qc.counts | qc.nexprs | qc.mito

# Summarize the number of cells removed for each reason.
DataFrame(LibSize=sum(qc.counts), NExprs=sum(qc.nexprs), MitoProp=sum(qc.mito), Total=sum(discard))

## DataFrame with 1 row and 4 columns
##     LibSize    NExprs  MitoProp     Total
##   <integer> <integer> <integer> <integer>
## 1       267        55        85       317

# remove cells that don't meet the thresholds
fixed_sce = sce[,!discard]
fixed_sce

## class: SingleCellExperiment 
## dim: 27058 1220 
## metadata(0):
## assays(1): counts
## rownames(27058): Gm29155 Pcmtd1 ... ENSMUSG00000095041.8
##   ENSMUSG00000095742.2
## rowData names(0):
## colnames(1220): AAACCTGAGCTTATCG AAACCTGGTTGAGTTC ... TTTGTCAGTAATAGCA
##   TTTGTCAGTCTCGTTC
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

Filtering with adaptive thresholds

If the hard thresholds used by the authors are not available, you can use a relaxed QC strategy and only remove cells with large mitochondrial proportions, using it as a proxy for cell damage. This reduces the risk of removing cell types with low RNA content, especially in a heterogeneous populations with many different cell types. More information about the isOutlier() function can be found here

high.mito <- isOutlier(stats$subsets_Mito_percent, type="higher")

# how many high mito cells?
sum(high.mito)

## [1] 292

# What are the cuttoffs?
attr(high.mito, "thresholds")

##    lower   higher 
##     -Inf 3.025697

# add the stats information to the sce object as col data
colData(sce) <- cbind(colData(sce), stats)

# add a column indicating the cells you want to discard
# in this case high.mito cells
sce$discard <- high.mito

# plot the distribution of the stats and indicate which cells will be discarded

plot.count = 
  plotColData(sce, 
              y="sum", 
              colour_by="discard") +
  scale_y_log10() + 
  ggtitle("Total count")

plot.genes = 
  plotColData(sce, 
              y="detected", 
              colour_by="discard") +
  scale_y_log10() + 
  ggtitle("Detected genes")

plot.mito = 
  plotColData(sce, 
              y="subsets_Mito_percent", 
              colour_by="discard") +
  ggtitle("Mito percent")

plot.all = plot.count | plot.genes | plot.mito

plot.all

ggsave(plot.all, 
       height = 4,
       file = paste0(results_dir, "/qc_plots.png"))

The high mito cells tend to have lower total UMI counts and lower number of detected genes.

Remove the high.mito cells from sce.

sce <- sce[,!high.mito]
sce

## class: SingleCellExperiment 
## dim: 27058 1245 
## metadata(0):
## assays(1): counts
## rownames(27058): Gm29155 Pcmtd1 ... ENSMUSG00000095041.8
##   ENSMUSG00000095742.2
## rowData names(0):
## colnames(1245): AAACCTGAGCTTATCG AAACCTGGTTGAGTTC ... TTTGTCAGTAATAGCA
##   TTTGTCAGTCTCGTTC
## colData names(7): sum detected ... total discard
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

Normalization and transformation

OSCA basic: Chapter 2

To accurately account for differences in library size (total count) between cells, we will first perform a quick cluster analysis to group similar cells together and calculate cell-specific scaling factors (often called size factors) using computeSumFactors(). See the detailed explanation from OSCA here

Once we have computed the size factors, we use the logNormCounts() function to compute normalized expression values for each cell. This is done by dividing the count for each gene with the appropriate size factor for that cell. The function also log-transforms the normalized values, creating a new assay called “logcounts”. See the detailed explanation from OSCA here

clusters <- quickCluster(sce)
sce <- computeSumFactors(sce, cluster=clusters)
sce <- logNormCounts(sce)

Feature selection

OSCA basic: Chapter 3

For cell clustering we need to select genes that contain useful information about the biology of the system while removing genes that contain random noise. The simplest approach to feature selection is to select the most variable genes based on their expression across cells. We’ll use getTopHVGs() to detect these genes.

The number of highly variable genes to consider is fairly arbitrary, with any value from 500 to 5000 considered “reasonable”. n=1000 is a common choice.

# store the highly variable genes within the single cell object
rowSubset(sce, "HVGs") <- getTopHVGs(sce, n=1000)

Dimesnionality reduction: PCA for downstream analysis

OSCA basic: Chapter 4

In single-cell RNA-seq analysis, we have a lot of genes being measured across thousands of cells. This creates a massive amount of data, with each cell having a unique expression pattern for thousands of genes. Handling and interpreting such large datasets can be challenging.

Principal Component Analysis (PCA) helps by reducing the complexity of the data. Instead of looking at thousands of genes, PCA combines them into a smaller number of “principal components.” Each principal component is a combination of gene expression patterns that captures the most variation in the data. We’ll use the PCs to define and visualize our cell clusters later.

We’ll run PCA on the expression of the highly variable genes.

sce <- runPCA(sce, subset_row=rowSubset(sce, "HVGs"))
sce

## class: SingleCellExperiment 
## dim: 27058 1245 
## metadata(0):
## assays(2): counts logcounts
## rownames(27058): Gm29155 Pcmtd1 ... ENSMUSG00000095041.8
##   ENSMUSG00000095742.2
## rowData names(1): HVGs
## colnames(1245): AAACCTGAGCTTATCG AAACCTGGTTGAGTTC ... TTTGTCAGTAATAGCA
##   TTTGTCAGTCTCGTTC
## colData names(8): sum detected ... discard sizeFactor
## reducedDimNames(1): PCA
## mainExpName: NULL
## altExpNames(0):

runPCA() reports the first 50 PCs by default, but how many do you actually need for downstream analysis? A simple heuristic for choosing the suitable number of PCs involves identifying the elbow point in the percentage of variance explained by successive PCs. This refers to the “elbow” in the curve of a scree plot. See OSCA advanced: Chapter 4 for more details.

We can plot a scree plot and identify the “elbow” using findElbowPoint().

# Percentage of variance explained is tucked away in the attributes.
percent.var <- attr(reducedDim(sce), "percentVar")

chosen.elbow <- findElbowPoint(percent.var)
chosen.elbow

## [1] 4

# make the scree plot
plot(percent.var, xlab="PC", ylab="Variance explained (%)")
abline(v=chosen.elbow, col="red")

# save as a pdf
pdf(file = paste0(results_dir, "/scree.pdf"))
plot(percent.var, xlab="PC", ylab="Variance explained (%)")
abline(v = chosen.elbow, col="red")
dev.off()

## png 
##   2

# store only the relevant PCs in the sce 
# with the name "PCA.elbow"
reducedDim(sce, "PCA.elbow") <- reducedDim(sce, "PCA")[,1:chosen.elbow]

Clustering

OSCA basic: Chapter 5

How do you find the “best” clusters within the single-cell data? I like this answer from the overview of the OSCA clustering chapter:

A more relevant question is “how well do the clusters approximate the cell types or states of interest?” Unfortunately, this is difficult to answer given the context-dependent interpretation of the underlying biology. Some analysts will be satisfied with resolution of the major cell types; other analysts may want resolution of subtypes; and others still may require resolution of different states (e.g., metabolic activity, stress) within those subtypes. Moreover, two clusterings can be highly inconsistent yet both valid, simply partitioning the cells based on different aspects of biology. Indeed, asking for an unqualified “best” clustering is akin to asking for the best magnification on a microscope without any context.

So, there’s no “one-size fits all” solution for finding the best clusters.

Adjust \(k\) to change the cluster resolution

One basic way to alter clustering is by changing the value \(k\), or the number of nearest neighbors of each cell to consider when building clusters.This controls the resolution of the clustering where higher \(k\) yields broader clusters. You can exploit this by experimenting with different values of \(k\) to obtain a satisfactory resolution. clusterCells() uses k=10 as a default and is a good place to start.

As mentioned above, we’ll cluster the cells in PC space rather than gene expression space for efficiency. We specify that the dimensionality reduction (“dimred”) we want to use is the PCs we defined using the elbow plot.

clust.10 <- clusterCells(sce, use.dimred="PCA.elbow")
table(clust.10)

## clust.10
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16 
##  82 101  49 117 102  95  82  51  89  56 109  96  35  82  41  58

k=10 gives us 16 clusters. Let’s try k=40 and k=20 for comparison.

clust.40 <- clusterCells(sce, 
                        use.dimred="PCA.elbow", 
                        BLUSPARAM=NNGraphParam(k=40))

table(clust.40)

## clust.40
##   1   2   3   4   5   6   7 
## 200 242 323 120 147 112 101

clust.20 <- clusterCells(sce, 
                        use.dimred="PCA.elbow", 
                        BLUSPARAM=NNGraphParam(k=20))
table(clust.20)

## clust.20
##   1   2   3   4   5   6   7   8   9  10  11  12  13 
##  88 181 112 116  88 106 151  92  72  53  65  57  64

Add the cluster information to the colData field of the sce object.

sce$clust.10 <- clust.10
sce$clust.40 <- clust.40
sce$clust.20 <- clust.20

It can also be useful to plot our QC metrics by cluster to see if cells cluster based on technical rather than biological factors. We can do this with plotColData(). I’ll use clust.20 as an example.

# plot
plot.count <- 
  plotColData(sce, 
              y="sum", 
              x = "clust.20",
              colour_by="clust.20") +
  scale_y_log10() + 
  ggtitle("Total count")

plot.genes <- 
  plotColData(sce, 
              y="detected", 
              x = "clust.20",
              colour_by="clust.20") +
  scale_y_log10() + 
  ggtitle("Detected genes")

plot.mito <- 
  plotColData(sce, 
              y="subsets_Mito_percent", 
              x = "clust.20",
              colour_by="clust.20") +
  ggtitle("Mito percent")

plot.all <- plot.count / plot.genes / plot.mito

plot.all

ggsave(plot.all, 
       file = paste0(results_dir, "/qc_plots_by_clust20.png"))

Dimesnionality reduction: UMAP and TSNE for visualization

OSCA basic: Chapter 4.5.2

Single-cell RNA-seq data is high-dimensional, with thousands of genes measured across thousands to millions of cells. Uniform manifold approximation and projection (UMAP) and \(t\)-stochastic neighbor embedding (t-SNE) effectively reduce this complexity to 2D space, making it easier to visualize while preserving some of the underlying biological structure.

It is arguable whether the UMAP or
t-SNE visualizations are more useful or aesthetically pleasing. UMAP aims to preserve more global structure but this necessarily reduces resolution within each visual cluster. However, UMAP is unarguably much faster, and for that reason alone, it is increasingly displacing t-SNE as the method of choice for visualizing large scRNA-seq data sets.

Important

When interpreting UMAP and t-SNE plots, it’s safest to focus on local neighborhoods, as these methods prioritizes preserving the relationships between nearest neighbors from high-dimensional space in the 2D embedding. If multiple cell types or clusters appear close together in the plot, they were likely also neighbors in higher-dimensional space. However, distances between non-neighboring cells may not be accurately represented, so using the spacing between clusters to infer similarity between distinct cell types is unreliable.

We’ll use PC space rather than gene expression space to find the UMAP and t-SNE coordinates for efficiency. We specify that the dimensionality reduction (“dimred”) we want to use is the PCs we defined using the elbow plot.

set.seed(100)

# runUMAP 
sce <- runUMAP(sce, dimred="PCA.elbow")

# runTSNE
sce <- runTSNE(sce, dimred="PCA.elbow")

sce

## class: SingleCellExperiment 
## dim: 27058 1245 
## metadata(0):
## assays(2): counts logcounts
## rownames(27058): Gm29155 Pcmtd1 ... ENSMUSG00000095041.8
##   ENSMUSG00000095742.2
## rowData names(1): HVGs
## colnames(1245): AAACCTGAGCTTATCG AAACCTGGTTGAGTTC ... TTTGTCAGTAATAGCA
##   TTTGTCAGTCTCGTTC
## colData names(11): sum detected ... clust.40 clust.20
## reducedDimNames(4): PCA PCA.elbow UMAP TSNE
## mainExpName: NULL
## altExpNames(0):

Plot the clusters colored by clust.10, clust.20 and clust.40 in UMAP space.

plot10 <- plotReducedDim(sce, dimred="UMAP", colour_by="clust.10") 

plot20 <- plotReducedDim(sce, dimred="UMAP", colour_by="clust.20")

plot40 <- plotReducedDim(sce, dimred="UMAP", colour_by="clust.40")

plot.all <- plot10 | plot20 | plot40 

plot.all

ggsave(plot.all, file = paste0(results_dir, "/UMAP_ktest.png"), width = 12, height = 4)

Plot the clusters colored by clust.10, clust.20 and clust.40 in TSNE space.

plot10 <- plotReducedDim(sce, dimred="TSNE", colour_by="clust.10") 

plot20 <- plotReducedDim(sce, dimred="TSNE", colour_by="clust.20")

plot40 <- plotReducedDim(sce, dimred="TSNE", colour_by="clust.40")

plot.all <- plot10 | plot20 | plot40 

plot.all

ggsave(plot.all, file = paste0(results_dir, "/TSNE_ktest.png"), width = 12, height = 4)

Doublet detection

OSCA advaced: Chapter 8

“Doublets” are barcodes that contain the transcriptomes of two cells instead of just one.

We’ll use scDblFinder() from the scDblFinder package to identify doublets for removal. scDblFinder() simulates thousands of doublets by combining the expression profiles of real cells in the data set. It then reports a “doublet score” for each cell that represents how similar the cell is to the simulated doublets vs real cells. Higher scoring cells are more likely to be doublets. The function also classifies the cells based on the scores and other criteria into “doublets” and “singlets”.

sce.dbl <- scDblFinder(sce)

# how many cells are classified as doublets?
table(sce.dbl$scDblFinder.class)

## 
## singlet doublet 
##    1194      51

# plot the doublet score in tsne space
tsne.dbl <- plotTSNE(sce.dbl, colour_by="scDblFinder.score")

# plot the clusters in tsne space for reference
tsne.20 <- plotTSNE(sce.dbl, colour_by="clust.20")

# make a vln plot of the doublet score by cluster
vln.dbl <- 
  plotColData(sce.dbl, 
              y="scDblFinder.score", 
              x = "clust.20",
              colour_by="scDblFinder.class")

# combine the plots with patchwork
plot.all <- (tsne.dbl | tsne.20) / vln.dbl + plot_layout(heights = c(2, 1))

plot.all

ggsave(plot.all, file = paste0(results_dir, "/TSNE_doublets.png"), width = 10, height=6)

The OSCA authors suggest that all cells from a cluster with a large average doublet score should be considered doublets. If no one cluster is over represented by doublets, as is the case here, it is safer to remove individual doublets.

# select the barcodes corresponding to singlets
singlets <- 
  as_tibble(colData(sce.dbl)) %>%
  mutate(barcodes = rownames(colData(sce.dbl))) %>%
  filter(scDblFinder.class == "singlet") %>%
  pull(barcodes)

# filter the sce for singlets
sce <- sce[,singlets]

Important Any time you remove cells from the analysis, it’s important to re-normalize the data, find new PCs and recluster.

#normalize
clusters <- quickCluster(sce)
sce <- computeSumFactors(sce, cluster=clusters)
sce <- logNormCounts(sce)

#feature selection
rowSubset(sce, "HVGs") <- getTopHVGs(sce, n=1000)

# PCA
sce <- runPCA(sce, subset_row=rowSubset(sce, "HVGs"))
percent.var <- attr(reducedDim(sce), "percentVar")
chosen.elbow <- findElbowPoint(percent.var)
reducedDim(sce, "PCA.elbow") <- reducedDim(sce, "PCA")[,1:chosen.elbow]

# cluster
clust.20 <- clusterCells(sce, 
                        use.dimred="PCA.elbow", 
                        BLUSPARAM=NNGraphParam(k=20))
sce$clust.20 <- clust.20

# runUMAP 
sce <- runUMAP(sce, dimred="PCA.elbow")

# runTSNE
sce <- runTSNE(sce, dimred="PCA.elbow")

Identify cluster enriched genes

OSCA basic: Chapter 6

To interpret our clustering results, we identify the genes that drive separation between clusters. These cluster enriched genes allow us to assign biological meaning to each cluster based on their functional annotation. We perform differential expression analysis between each pair of clusters clusters the using scoreMarkers() function from scran. For each pairwise comparison three different effect sizes are reported to describe the size of the gene expression difference:

Cohen’s \(d\) (logFC.cohen) is a standardized log-fold change where the difference in the mean log-expression between groups is scaled by the average standard deviation across groups. In other words, it is the number of standard deviations that separate the means of the two groups. The interpretation is similar to the log-fold change; positive values indicate that the gene is upregulated in our cluster of interest, negative values indicate downregulation and values close to zero indicate that there is little difference.
Area under the curve (AUC) represents the probability that a randomly chosen observation from our cluster of interest is greater than a randomly chosen observation from the other cluster. A value of 1 corresponds to upregulation, where all values of our cluster of interest are greater than any value from the other cluster; a value of 0.5 means that there is no net difference in the location of the distributions; and a value of 0 corresponds to downregulation.
The log-fold change in the proportion of cells with detected expression between clusters (logFC.detected). This ignores any information about the magnitude of expression, only considering whether any expression is detected at all. Again, positive values indicate that a greater proportion of cells express the gene in our cluster of interest compared to the other cluster.

If you ave \(N\) clusters, then you’ll have \(N-1\) pairwise comparisons for each gene for each cluster. scoreMarkers() summarizes the effect sizes of the pairwise comparisons for each cluster with various summary statistics (mean, median, min, max), that may be useful in different scenarios. You can read more about them here.

# use the k=20 clusters as an example
markers20 <- scoreMarkers(sce, sce$clust.20)

The scores for all pairwise comparisons involving a particular cluster are consolidated into a single DataFrame for that cluster. The scoreMarkers() returns a list of DataFrames where each DataFrame corresponds to a cluster and each row of the DataFrame corresponds to a gene. In the DataFrame for cluster \(X\), the columns contain the self.average, the mean log-expression in \(X\); other.average, the grand mean across all other clusters; self.detected, the proportion of cells with detected expression in \(X\); other.detected, the mean detected proportion across all other clusters; and finally, the effect size summaries generated from all pairwise comparisons involving \(X\). .

# markers20 is a list object. Each element of the list is a data frame with the summary 
markers20

## List of length 12
## names(12): 1 2 3 4 5 6 7 8 9 10 11 12

colnames(markers20[["1"]]) # statistics for cluster 1.

##  [1] "self.average"          "other.average"         "self.detected"        
##  [4] "other.detected"        "mean.logFC.cohen"      "min.logFC.cohen"      
##  [7] "median.logFC.cohen"    "max.logFC.cohen"       "rank.logFC.cohen"     
## [10] "mean.AUC"              "min.AUC"               "median.AUC"           
## [13] "max.AUC"               "rank.AUC"              "mean.logFC.detected"  
## [16] "min.logFC.detected"    "median.logFC.detected" "max.logFC.detected"   
## [19] "rank.logFC.detected"

I’ve written a function, saveTopGenesDF(), that pulls the top n genes for each cluster given a chosen statistic into a data frame. If write_df = TRUE it also writes a .csv file to resuts_dir.

# `saveTopGeneDF()` is a function to save the top cluster enriched genes for each cluster
# marker_list is the output of scoreMarkers
# topn is the number of genes you want
# stat_oi is the  stat you want to sort by
# if write_df=TRUE the df is saved to a csv file
# in results_dir
saveTopGenesDF = function(marker_list, 
                          topn, 
                          stat_oi,
                          write_df=FALSE){
  
top_df <- {}

for(i in 1:length(marker_list)){
  
 if(grepl("rank", stat_oi)){ # if the stat is a rank take the min
   
   top_clust <- 
    as_tibble(marker_list[[i]]) %>% # select element i from the list
    mutate(Gene = rownames(marker_list[[i]]), # add a column with the genes
           Cluster = paste0("cluster", i)) %>% # add a column with the cluster
    slice_min(order_by = !!sym(stat_oi), # order by stat oi and take the min n 
              n = topn) %>%
    select(Cluster, # select relevant columns
           Gene,
           self.average,
           other.average,
           self.detected,
           other.detected,
           !!stat_oi)
 
  top_df <- rbind(top_df, top_clust) #bind into one df
 
  } else { # if the stat is not a rank take the max
 
   top_clust <- 
    as_tibble(marker_list[[i]]) %>% # select element i from the list
    mutate(Gene = rownames(marker_list[[i]]), # add a column with the genes
           Cluster = paste0("cluster", i)) %>% # add a column with the cluster
    slice_max(order_by = !!sym(stat_oi), # order by stat oi and take the top n 
              n = topn) %>%
    select(Cluster, # select relevant columns
           Gene,
           self.average,
           other.average,
           self.detected,
           other.detected,
           !!stat_oi)
 
 top_df <- rbind(top_df, top_clust) #bind into one df
     
 }
 
}

if(write_df==TRUE){
  write.csv(top_df, file = paste0(results_dir, "/marker_genes_top", topn, "_", stat_oi, ".csv"))
    }
return(top_df)
}

mean.logFC.cohen is a reasonable default for most applications. Here, we’ll pull the top 3 genes by mean.logFC.cohen for each cluster.

top_mean <- saveTopGenesDF(markers20, 3, "mean.logFC.cohen")

# show the first three clusters
as.data.frame(top_mean) %>% filter(Cluster %in% c("cluster1", "cluster2", "cluster3"))

##    Cluster          Gene self.average other.average self.detected
## 1 cluster1          H1f7     5.416820     2.6026708             1
## 2 cluster1         Irgc1     6.092426     4.0904163             1
## 3 cluster1         Hmgb4     6.546284     4.1063604             1
## 4 cluster2 1700001P01Rik     4.584547     1.5035519             1
## 5 cluster2 1700029E06Rik     3.131958     0.5249133             1
## 6 cluster2         Tex36     4.744866     2.1979527             1
## 7 cluster3      Mphosph8     3.066313     0.4437957             1
## 8 cluster3        Mllt10     3.760494     0.6540869             1
## 9 cluster3 D130017N08Rik     2.991809     0.3582564             1
##   other.detected mean.logFC.cohen
## 1      0.7817510         3.688171
## 2      0.9653472         3.366224
## 3      0.9714688         3.353837
## 4      0.5772189         5.192025
## 5      0.2496723         5.031906
## 6      0.7120591         4.458102
## 7      0.2668179         4.950068
## 8      0.3646330         4.849644
## 9      0.2272789         4.187093

min.logFC.cohen is more strict, and can be useful to find genes that are specifically enriched in one cluster.

top_min <- saveTopGenesDF(markers20, 3, "min.logFC.cohen")

# show the first three clusters
as.data.frame(top_min) %>% filter(Cluster %in% c("cluster1", "cluster2", "cluster3"))

##    Cluster          Gene self.average other.average self.detected
## 1 cluster1 1700017M07Rik     1.741367    0.09938439     0.6838710
## 2 cluster1        Csrnp1     2.349789    0.40863359     0.7935484
## 3 cluster1       Tmem144     3.108212    0.46517059     0.8709677
## 4 cluster2 1700029E06Rik     3.131958    0.52491328     1.0000000
## 5 cluster2 4930596D02Rik     3.162792    0.75857982     0.9898990
## 6 cluster2         Tex36     4.744866    2.19795270     1.0000000
## 7 cluster3          Cks2     3.501081    0.79304798     1.0000000
## 8 cluster3        Lypd10     2.190546    0.23745206     0.9829060
## 9 cluster3       Gm21269     2.099048    0.16499472     0.9743590
##   other.detected min.logFC.cohen
## 1     0.05876498       1.2316642
## 2     0.28922950       1.0033063
## 3     0.22656074       0.9774013
## 4     0.24967229       1.3122755
## 5     0.37159787       1.3069628
## 6     0.71205906       1.1382828
## 7     0.46259710       1.8619876
## 8     0.17156476       1.6595942
## 9     0.12063799       1.6513117

Plot the top genes from “mean.logFC.cohen” and “min.logFC.cohen” for cluster 1.

goi_mean <- 
  top_mean %>%
  filter(Cluster == "cluster1") %>%
  slice_max(order_by = mean.logFC.cohen, 
            n =3) %>%
  pull(Gene)

goi_mean

## [1] "H1f7"  "Irgc1" "Hmgb4"

goi_min <- 
  top_min %>%
  filter(Cluster == "cluster1") %>%
  slice_max(order_by = min.logFC.cohen, 
            n =3) %>%
  pull(Gene)

goi_min

## [1] "1700017M07Rik" "Csrnp1"        "Tmem144"

plot.top.mean <- 
  plotExpression(sce,
                 features=goi_mean,
                 x="clust.20", colour_by="clust.20") +
  ggtitle("top mean.logFC.cohen")

plot.top.min <- 
  plotExpression(sce,
                 features=goi_min,
                 x="clust.20", colour_by="clust.20") +
  ggtitle("top min.logFC.cohen")

plot.both <- plot.top.mean / plot.top.min

plot.both

ggsave(plot.both, 
       file = paste0(results_dir, "/cluster1_top_genes.png"),
       height = 12, width = 10)

Assigning cell labels from gene sets

OSCA basic: Chapter 7

Assign lables to single cells

One common way to assign cell type labels to single cells is using pre-defined cell-type marker genes. Specifically, we use the AUCell package to identify marker sets that are highly expressed in each cell. The marker genes should be formatted as a list where each element of the list is a character vector of genes for each cell type. I’ll make this list from the marker genes identified by Lukassen et al. in SuppTableGenes.xlsx.

SuppTableGenes.xlsx is an example of a “messy” data frame. It’s often easier to deal with data in a “tidy” format.

In Tidy data:

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

We’ll use functions from the dplyr package in the Tidyverse to reformat SuppTableGenes.xlsx into a table with the following columns:

GeneName | CellType | Log2FC | Pval |

Where each row represents the log fold change and p-value of a gene in a specific cell type versus all other cell types. Then, we’ll filter for positive fold changes, select the 100 genes with the smalest p-value in each cluster, and turn the data frame into a list of data frames.

# read in the cluster enriched genes from Lukassen
sup_tab <- read_excel(paste0(data_dir, "/Lukassen_testes/SuppTableGenes.xlsx"))
glimpse(sup_tab)

## Rows: 27,790
## Columns: 29
## $ EnsemblID                         <chr> "ENSMUSG00000079701", "ENSMUSG000000…
## $ GeneName                          <chr> "Ssxb3", "Ssxb1", "Ssxb2", "Gm12695"…
## $ `RS1 Average`                     <dbl> 0.6408870, 1.9710779, 1.8309237, 1.1…
## $ `RS1 Log2 Fold Change vs all`     <dbl> 5.018199, 4.160671, 4.043102, 3.8635…
## $ `RS1 P-Value`                     <dbl> 6.699313e-48, 4.186727e-45, 6.198828…
## $ `SC2 Average`                     <dbl> 0.038339822, 0.101548718, 0.09533145…
## $ `SC2 Log2 Fold Change vs all`     <dbl> -2.601868, -2.955995, -2.957196, -2.…
## $ `SC2 P-Value`                     <dbl> 8.069916e-11, 4.570008e-16, 4.901922…
## $ `CS Average`                      <dbl> 0.002912556, 0.029125559, 0.03495067…
## $ `CS Log2 Fold Change vs all`      <dbl> -5.092232, -4.342928, -4.010580, -4.…
## $ `CS P-Value`                      <dbl> 4.391694e-08, 2.714251e-09, 1.488424…
## $ `RS2 Average`                     <dbl> 0.009164249, 0.241936166, 0.26942891…
## $ `RS2 Log2 Fold Change vs all`     <dbl> -4.27290084, -1.42161408, -1.1571368…
## $ `RS2 P-Value`                     <dbl> 1.207434e-10, 3.110400e-03, 1.837556…
## $ `ES Average`                      <dbl> 0.003711412, 0.022268474, 0.00371141…
## $ `ES Log2 Fold Change vs all`      <dbl> -4.705656, -4.611644, -6.333842, -3.…
## $ `ES P-Value`                      <dbl> 7.676492e-06, 1.277472e-07, 5.895288…
## $ `SC1 Average`                     <dbl> 0.005288957, 0.037022699, 0.02644478…
## $ `SC1 Log2 Fold Change vs all`     <dbl> -4.156645, -3.869184, -4.196458, -4.…
## $ `SC1 P-Value`                     <dbl> 8.254880e-04, 2.035123e-04, 8.704576…
## $ `Sertoli Average`                 <dbl> 0.0000000, 0.4383190, 0.0000000, 0.0…
## $ `Sertoli Log2 Fold Change vs all` <dbl> 0.023696955, -0.105260671, -1.602710…
## $ `Sertoli P-Value`                 <dbl> 1.0000000, 1.0000000, 1.0000000, 1.0…
## $ `Spg Average`                     <dbl> 0.0000000, 0.0000000, 0.0000000, 0.0…
## $ `Spg Log2 Fold Change vs all`     <dbl> 0.3243484, -1.3911729, -1.3020588, -…
## $ `Spg P-Value`                     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ `Leydig Average`                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ `Leydig Log2 Fold Change vs all`  <dbl> 0.6014802, -1.1140411, -1.0249270, -…
## $ `Leydig P-Value`                  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …

# list the 100 genes with the smallest p-val
# for each cell type
top_lukassen_df <- 
  sup_tab %>%
  # Select columns that include "GeneName", "P-Value", and "Log2" (fold change columns)
  select(GeneName, contains("P-Value"), contains("Log2")) %>% 
  # Remove duplicate rows based on the "GeneName" column
  filter(!duplicated(GeneName)) %>%
  # Convert the data frame from wide to long format
  pivot_longer(cols = -GeneName,   # Keep "GeneName" as is, and reshape other columns
               names_to = "CellType", # Create a new column "CellType" from the names of reshaped columns
               values_to = "val") %>% # Store the values in a new column "val"
  # Create a new column "Stat" that categorizes values as either "Pval" or "Log2FC"
  mutate(Stat = case_when(
    grepl("P-Value", CellType) ~ "Pval",  # If "P-Value" is in the "CellType" column, set "Stat" as "Pval"
    grepl("Log2", CellType) ~ "Log2FC"))  %>% # If "Log2" is in the "CellType" column, set "Stat" as "Log2FC"
  # Clean up the "CellType" names by removing unnecessary strings
  mutate(CellType = gsub(" P-Value", "", CellType), 
         CellType = gsub(" Log2 Fold Change vs all", "", CellType)) %>% 
  # Convert the data back from long to wide format
  pivot_wider(names_from = Stat,  # Use "Stat" column (which contains "Pval" or "Log2FC") as the new column names
              values_from = val) %>% # Use values from "val" column as the data for the new columns
  # Filter the data to keep only rows where "Log2FC" is positive
  filter(Log2FC > 0) %>%
  # Group the data by "CellType"
  group_by(CellType) %>% 
  # Select the top 100 rows with the smallest "Pval" within each "CellType" group
  slice_min(Pval, n = 100) %>% 
  # Remove the grouping structure from the data frame
  ungroup() %>%
  # Filter out any remaining duplicate "GeneName" entries, keeping only one occurrence
  filter(!duplicated(GeneName))
  
top_lukassen_list <- 
  top_lukassen_df %>%
  # Group the data by the "CellType" column
  group_by(CellType) %>%
  # Split the grouped data frame into a list of data frames
  # Each element in the list corresponds to a unique "CellType"
  split(f = as.factor(.$CellType))

# pull just the genes to use with AUCell 
for_aucell <- lapply(top_lukassen_list, function(x) x %>% pull(GeneName))

Now, we’ll use AUCell_buildRankings() to rank genes by their expression value in each cell. Then AUCell_calcAUC will quantify the enrichment of the top Lukassen markers in each cell. Higher scores correspond to higher enrichment.

rankings <- AUCell_buildRankings(as.matrix(counts(sce)), 
                                 plotStats=FALSE,
                                 verbose=FALSE)

cell.aucs <- AUCell_calcAUC(for_aucell, rankings)

Next, we’ll check how well the marker genes discriminate cell types. In heterogeneous populations, the distribution for each label should be bimodal with one high-scoring peak containing cells of that cell type and a low-scoring peak containing cells of other types. In populations where a particular cell type is expected, lack of clear bimodality for the corresponding label may indicate that its gene set is not sufficiently informative.

par(mfrow=c(3,3))
AUCell_exploreThresholds(cell.aucs, plotHist=TRUE, assign=TRUE)

# save the plot
pdf(file = paste0(results_dir, "/aucell_check.pdf"))
par(mfrow=c(3,3))
AUCell_exploreThresholds(cell.aucs, plotHist=TRUE, assign=TRUE) 
dev.off()

Based on the examples from the AUCell vignette, these look as good as random gene sets. But let’s see what the results look like.

Assign the label with the highest score to each cell.

results <- t(assay(cell.aucs))
# Take the marker set with the top AUC as the label for that cell.
aucell.labels <- colnames(results)[max.col(results)]

# add the labels to the sce
colData(sce) <- cbind(colData(sce), aucell.labels)

# plot the labels on the tnse plot
tsne.lab <- plotTSNE(sce, colour_by="aucell.labels")

# plot the clusters in tsne space for reference
tsne.20 <- plotTSNE(sce, colour_by="clust.20")

# combine the plots with patchwork
plot.all <- tsne.lab | tsne.20
plot.all

# Save the tsne plot as a PNG file in the specified results directory
ggsave(plot.all, file = paste0(results_dir, "/TSNE_aucell_labels.png"), width = 8, height=4)

These actually look pretty reasonable!

Assign lables to clusters

We can also assign cell-type labels to clusters of cells instead of individual cells. AUCell works best with large numbers of markers, but you might not always have access to that information. This is an example of using a short list of cell-type markers to label cell clusters instead of individual cells. We’ll make a heatmap that shows the mean.logFC.cohen of the marker genes from figure 2 in each cluster. The rows represent marker genes. The columns are clusters. The genes are labeled by the cells they are expressed in according to supplemental table 3. If a cluster has a high mean.logFC.cohen for a specific set of cell-type marker genes, the cluster may represent that cell type.

# This is a df I made using info from supplemental table 3
fig2 <- read.delim(paste0(data_dir, "/Lukassen_testes/fig2_genes.txt"))

# Set row names to the values in the 'Gene.name' column
rownames(fig2) <- fig2$Gene.name 

# Remove the 'Gene.name' column from the data frame since it's now used as row names
fig2$Gene.name <- NULL

# Replace all occurrences of 0 with "NO" in the data frame
fig2[fig2 == 0] <- "NO"

# Replace all occurrences of 1 with "YES" in the data frame
fig2[fig2 == 1] <- "YES"

# Define color mappings for each cell type based on the values "YES" (black) and "NO" (white)
ann_colors <- list(
  Early.Sgonia = c(YES="black", NO="white"),
  Late.Sgonia = c(YES="black", NO="white"),
  Early.Scytes = c(YES="black", NO="white"),
  Late.Scytes = c(YES="black", NO="white"),
  Round.Stids = c(YES="black", NO="white"),
  Later.Stids = c(YES="black", NO="white"),
  Sertoli = c(YES="black", NO="white"),
  Leydig = c(YES="black", NO="white")
)

# This retrieves the "mean.logFC.cohen" of
# every gene for each  of our clusters
all <- saveTopGenesDF(markers20, nrow(markers20[[1]]), "mean.logFC.cohen")

# Filter the 'all' data frame to keep only the marker genes of interest that are present in 'fig2'
goi <- all %>% filter(Gene %in% rownames(fig2))

# Reshape the data frame from long to wide format
# This spreads the "mean.logFC.cohen" values across clusters for each gene
goi_wide <- 
  goi %>%
  pivot_wider(id_cols = Gene,
              names_from = Cluster,
              values_from = mean.logFC.cohen)

# Convert the wide-format tibbl to a regular data frame
goi_wide <- as.data.frame(goi_wide)

# Set the row names of the data frame to the gene names
rownames(goi_wide) <- goi_wide$Gene

# Remove the 'Gene' column since it's now used as row names
goi_wide$Gene <- NULL

# Define a color palette for the heatmap using a reversed "RdBu" color scheme
color_pal <- colorRampPalette(rev(brewer.pal(n = 11, name ="RdBu")))(20)

# Define breaks for the color scaling in the heatmap
# The breaks are calculated to have less emphasis on the values close to zero
myBreaks <- c(seq(min(goi_wide), 0, length.out=ceiling(20/2) + 1), 
              seq(max(goi_wide)/20, max(goi_wide),
                  length.out=floor(20/2)))

# Generate the heatmap with specified settings
pheatmap(goi_wide,
         color = color_pal,
         border_color = NA,                 # No border color around heatmap cells
         cluster_cols = TRUE,               # Cluster columns (i.e., gene clusters)
         cluster_rows = TRUE,               # Cluster rows (i.e., sample clusters)
         show_rownames = TRUE,              # Show row names (gene names)
         annotation_row = fig2,             # Add annotations from the fig2 data frame
         annotation_colors = ann_colors,    # Apply the custom color mapping to the annotations
         breaks=myBreaks)                   # Use the custom breaks for color scaling

# Create a PDF file to save the heatmap
pdf(file = paste0(results_dir, "/fig2_heatmap.pdf"))
pheatmap(goi_wide,
         color = color_pal,
         border_color = NA,                 
         cluster_cols = TRUE,               
         cluster_rows = TRUE,               
         show_rownames = TRUE,              
         annotation_row = fig2,             
         annotation_colors = ann_colors,    
         breaks=myBreaks) 
# Close the PDF device, saving the heatmap
dev.off()

## pdf 
##   3

Based on the heatmap, these could be reasonable labels for the clusters:

Cluster	CellType
Cluster1	Round Stids
Cluster2	Sgonia
Cluster3	Scytes
Cluster4	Scytes
Cluster5	Scytes
Cluster6	Sgonia
Cluster7	Sgonia
Cluster8	Later Stids
Cluster9	Round Stids
Cluster10	Later Stids
Cluster11	Late Scytes
Cluster12	Sgonia

Add them to the sce and plot.

# make a vector of the cluster labels in order of the cluster number
type_vec <-  c("Round Stids", "Sgonia", "Scytes",
             "Scytes", "Scytes", "Sgonia",
             "Sgonia", "Later Stids", "Round Stids", 
             "Later Stids", "Late Scytes", "Sgonia")

# map the cell type lable of every cell to the cluster number 
# of every cell
label_vec <- type_vec[colData(sce)$clust.20]

# add the cell type lables to coData
colData(sce)$fig2.labels <- label_vec

# plot
tsne.lab <- plotTSNE(sce, colour_by="fig2.labels")
ggsave(file = paste0(results_dir, "/TSNE_fig2.labels.pdf"),
       width=8, height=4)

tsne.lab | tsne.20

Mean expression tables

Use aggregateAcrossCells() from the scuttle package to get the mean expression of genes within groups of cells. We’ll find the mean by “fig2.labels” as an example.

mean_sce <- summarizeAssayByGroup(sce, 
                                 ids = sce$fig2.labels,
                                 statistics = "mean",
                                 assay.type = "logcounts")

# pull the mean assay into a data frame
mean_df <- assay(mean_sce, "mean")

head(mean_df)

##         Late Scytes Later Stids Round Stids       Scytes     Sgonia
## Gm29155  0.00000000  0.01798324  0.02397328 0.0007110242 0.00000000
## Pcmtd1   0.01489881  0.00000000  0.00000000 0.0024451762 0.00000000
## Gm26901  0.09557378  0.04411660  0.02758087 0.0865191129 0.08797026
## Cdh7     0.00000000  0.00000000  0.00000000 0.0000000000 0.00000000
## Exo1     0.10075434  0.00000000  0.03333784 0.1070970622 0.04743240
## Becn2    0.00000000  0.00000000  0.00000000 0.0000000000 0.00000000

# write it to a csv file
write.csv(mean_df, file = paste0(results_dir, "/mean_logcounts_by_fig2.labels.csv"))

Start with deposited counts and supplied metadata

If authors of a study have deposited the count matrix they used in GEO, it’s often more efficient to use that than to re-process the raw sequencing files.

Using deposited data will also allow you to more accurately reproduce the original study. Often, only the cells used in the study after filtering for empty drops, outliers, and low quality cells are included in the deposited data, so there’s no need to filter yourself.

If the authors also include a meta data file that lists a cell-type label for each cell, you will not have to repeat the clustering and annotation steps.

Files deposited by the authors

/data/Lukassen_testes/GSE104556_matrix.mtx.gz: From GEO (GSE104556). The gene x barcode UMI count matrix.
/data/Lukassen_testes/GSE104556_genes.tsv.gz: From GEO (GSE104556). The genes that correspond to the rows of the UMI count matrix.
/data/Lukassen_testes/GSE104556_barcodes.tsv.gz: From GEO (GSE104556). The barcodes that correspond to the columns of the UMI count matrix.
/data/Lukassen_testes/SuppTableCells.xlsx: From figshare. Cell meta data including the cell-type labels.
/data/Lukassen_testes/SuppTableGenes.xlsx: From figshare. Gene meta data including gene names.

Load the count matrix.

# Read the count matrix data from a Matrix Market file (.mtx.gz) and store it in the 'counts' variable
counts <- readMM(paste0(data_dir, "/Lukassen_testes/GSE104556_matrix.mtx.gz"))

# Display the dimensions of the 'counts' matrix (number of genes x number of cells)
dim(counts)

## [1] 51868  2552

# Read the barcodes file (which identifies individual cells) from a tab-delimited file and store it in 'barcodes'
barcodes <- read.delim(paste0(data_dir, "/Lukassen_testes/GSE104556_barcodes.tsv.gz"), header = FALSE)

# Display the dimensions of the 'barcodes' data (number of cells)
dim(barcodes)

## [1] 2552    1

# Display the first few rows of the 'barcodes' data
head(barcodes)

##                   V1
## 1 AAACCTGAGCTTATCG-1
## 2 AAACCTGGTTGAGTTC-1
## 3 AAACCTGTCAACGAAA-1
## 4 AAACGGGCACAGGTTT-1
## 5 AAACGGGTCATTTGGG-1
## 6 AAACGGGTCCTCATTA-1

# Display the last few rows of the 'barcodes' data
tail(barcodes)

##                      V1
## 2547 TTTGGTTGTGCAACTT-2
## 2548 TTTGGTTTCTACTATC-2
## 2549 TTTGTCACATCGGTTA-2
## 2550 TTTGTCAGTCCAGTTA-2
## 2551 TTTGTCATCTGTCCGT-2
## 2552 TTTGTCATCTTTCCTC-2

# The barcodes have "-1" or "-2" appended to them, corresponding to cells derived from mouse 1 or mouse 2 in the study
# Assign the barcodes as the column names of the 'counts' matrix
colnames(counts) <- barcodes$V1

# Read the gene information from a tab-delimited file and store it in 'genes'
genes <- read.delim(paste0(data_dir, "/Lukassen_testes/GSE104556_genes.tsv.gz"), header = FALSE)

# Display the dimensions of the 'genes' data (number of genes)
dim(genes)

## [1] 51868     2

# Display the first few rows of the 'genes' data
head(genes)

##                   V1            V2
## 1 ENSMUSG00000102693 4933401J01Rik
## 2 ENSMUSG00000064842       Gm26206
## 3 ENSMUSG00000051951          Xkr4
## 4 ENSMUSG00000102851       Gm18956
## 5 ENSMUSG00000103377       Gm37180
## 6 ENSMUSG00000104017       Gm37363

# Assign the gene names as the row names of the 'counts' matrix
rownames(counts) <- genes$V1

# Display the first few rows of the first 5 columns of the 'counts' matrix
head(counts[, 1:5])

## 6 x 5 sparse Matrix of class "dgTMatrix"
##                    AAACCTGAGCTTATCG-1 AAACCTGGTTGAGTTC-1 AAACCTGTCAACGAAA-1
## ENSMUSG00000102693                  .                  .                  .
## ENSMUSG00000064842                  .                  .                  .
## ENSMUSG00000051951                  .                  .                  .
## ENSMUSG00000102851                  .                  .                  .
## ENSMUSG00000103377                  .                  .                  .
## ENSMUSG00000104017                  .                  .                  .
##                    AAACGGGCACAGGTTT-1 AAACGGGTCATTTGGG-1
## ENSMUSG00000102693                  .                  .
## ENSMUSG00000064842                  .                  .
## ENSMUSG00000051951                  .                  .
## ENSMUSG00000102851                  .                  .
## ENSMUSG00000103377                  .                  .
## ENSMUSG00000104017                  .                  .

# get the number of cells each gene is expressed in
rs <- rowSums(counts!=0)

# remove genes expressed 0 cells
filt_mat <- counts[rs>0,]

Load the cell and gene meta data.

# Load gene data from an Excel file into a data frame
sup_genes <- read_excel(paste0(data_dir, "/Lukassen_testes/SuppTableGenes.xlsx"))

# Quickly view the structure and content of the 'sup_genes' data frame
glimpse(sup_genes)

## Rows: 27,790
## Columns: 29
## $ EnsemblID                         <chr> "ENSMUSG00000079701", "ENSMUSG000000…
## $ GeneName                          <chr> "Ssxb3", "Ssxb1", "Ssxb2", "Gm12695"…
## $ `RS1 Average`                     <dbl> 0.6408870, 1.9710779, 1.8309237, 1.1…
## $ `RS1 Log2 Fold Change vs all`     <dbl> 5.018199, 4.160671, 4.043102, 3.8635…
## $ `RS1 P-Value`                     <dbl> 6.699313e-48, 4.186727e-45, 6.198828…
## $ `SC2 Average`                     <dbl> 0.038339822, 0.101548718, 0.09533145…
## $ `SC2 Log2 Fold Change vs all`     <dbl> -2.601868, -2.955995, -2.957196, -2.…
## $ `SC2 P-Value`                     <dbl> 8.069916e-11, 4.570008e-16, 4.901922…
## $ `CS Average`                      <dbl> 0.002912556, 0.029125559, 0.03495067…
## $ `CS Log2 Fold Change vs all`      <dbl> -5.092232, -4.342928, -4.010580, -4.…
## $ `CS P-Value`                      <dbl> 4.391694e-08, 2.714251e-09, 1.488424…
## $ `RS2 Average`                     <dbl> 0.009164249, 0.241936166, 0.26942891…
## $ `RS2 Log2 Fold Change vs all`     <dbl> -4.27290084, -1.42161408, -1.1571368…
## $ `RS2 P-Value`                     <dbl> 1.207434e-10, 3.110400e-03, 1.837556…
## $ `ES Average`                      <dbl> 0.003711412, 0.022268474, 0.00371141…
## $ `ES Log2 Fold Change vs all`      <dbl> -4.705656, -4.611644, -6.333842, -3.…
## $ `ES P-Value`                      <dbl> 7.676492e-06, 1.277472e-07, 5.895288…
## $ `SC1 Average`                     <dbl> 0.005288957, 0.037022699, 0.02644478…
## $ `SC1 Log2 Fold Change vs all`     <dbl> -4.156645, -3.869184, -4.196458, -4.…
## $ `SC1 P-Value`                     <dbl> 8.254880e-04, 2.035123e-04, 8.704576…
## $ `Sertoli Average`                 <dbl> 0.0000000, 0.4383190, 0.0000000, 0.0…
## $ `Sertoli Log2 Fold Change vs all` <dbl> 0.023696955, -0.105260671, -1.602710…
## $ `Sertoli P-Value`                 <dbl> 1.0000000, 1.0000000, 1.0000000, 1.0…
## $ `Spg Average`                     <dbl> 0.0000000, 0.0000000, 0.0000000, 0.0…
## $ `Spg Log2 Fold Change vs all`     <dbl> 0.3243484, -1.3911729, -1.3020588, -…
## $ `Spg P-Value`                     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ `Leydig Average`                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ `Leydig Log2 Fold Change vs all`  <dbl> 0.6014802, -1.1140411, -1.0249270, -…
## $ `Leydig P-Value`                  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …

# Convert the 'sup_genes' tibble (from read_excel) into a standard data frame
sup_genes <- as.data.frame(sup_genes)

# Set the Ensembl ID column as the row names of the 'sup_genes' data frame
rownames(sup_genes) <- sup_genes$EnsemblID

# Remove the 'EnsemblID' column now that it is used as row names
sup_genes$EnsemblID <- NULL

# Load cell data from another Excel file into a data frame
sup_cells <- read_excel(paste0(data_dir, "/Lukassen_testes/SuppTableCells.xlsx"))

# Quickly view the structure and content of the 'sup_cells' data frame
glimpse(sup_cells)

## Rows: 2,552
## Columns: 13
## $ Barcode                 <chr> "AAACCTGAGCTTATCG-1", "AAACCTGGTTGAGTTC-1", "A…
## $ Replicate               <chr> "mouse 1", "mouse 1", "mouse 1", "mouse 1", "m…
## $ `TSNE-1`                <dbl> 1.138438, 6.849414, -23.578904, -26.702202, -4…
## $ `TSNE-2`                <dbl> -32.931800, -34.013543, -13.438399, -11.977982…
## $ UMICount                <dbl> 23114, 19002, 22849, 27921, 18466, 15074, 1651…
## $ GeneCount               <dbl> 4977, 4694, 4908, 5392, 4718, 3714, 4425, 7810…
## $ ProportionMitochondrial <dbl> 3.461555e-04, 4.210748e-04, 4.377517e-05, 1.43…
## $ Cluster                 <dbl> 4, 1, 4, 4, 5, 5, 1, 2, 4, 2, 2, 5, 3, 2, 2, 2…
## $ CellType                <chr> "RS2", "RS1", "RS2", "RS2", "ES", "ES", "RS1",…
## $ MonoclePseudotime       <dbl> 28.0360268, 26.6232265, 33.4668827, 34.5019604…
## $ MonocleRank             <dbl> 1502, 1409, 1777, 1805, 1881, 1931, 1045, 257,…
## $ ScratPseudotime         <dbl> 0.019297217, 0.021289324, 0.014213382, 0.01407…
## $ ScratRank               <dbl> 1474, 1393, 1767, 1782, 1899, 1951, 1048, 511,…

# Convert the 'sup_cells' tibble (from read_excel) into a standard data frame
sup_cells <- as.data.frame(sup_cells)

# Set the Barcode column as the row names of the 'sup_cells' data frame
rownames(sup_cells) <- sup_cells$Barcode

# Remove the 'Barcode' column now that it is used as row names
sup_cells$Barcode <- NULL

Create a SingleCellExperiment object using the filtered matrix, with the counts stored in the ‘counts’ assay. Add sup_cells to the colData slot. Add gene names to the rowData slot.

sce <- SingleCellExperiment(assays = list(counts = filt_mat), colData = sup_cells)

Add the gene meta data to the sce object.

# Reorder the rows of 'sup_genes' to match the rownames of the 'sce' object
# This ensures that the gene names and associated data in 'sup_genes' are aligned with the data in 'sce'
sup_genes <- sup_genes[match(rownames(sce), rownames(sup_genes)),]

# Add a new column to the 'rowData' of the 'sce' object to store the gene names from 'sup_genes'
rowData(sce)$GeneNames <- sup_genes$GeneName

If you want, you can switch the rownames(sce) from ensembl IDs to gene names so that other functions will operate on the gene names instead of ensembl IDs.

# Add Ensembl IDs to 'rowData' to preserve them, storing them under a new column 'GeneIDs'
rowData(sce)$GeneIDs <- rownames(sce)

# Replace the row names of 'sce' with the gene names stored in the 'GeneNames' column of 'rowData'
# This makes gene names (rather than Ensembl IDs) the primary row identifiers in 'sce'
rownames(sce) <- rowData(sce)$GeneNames

# Display the 'sce' object to confirm the changes
sce

## class: SingleCellExperiment 
## dim: 27790 2552 
## metadata(0):
## assays(1): counts
## rownames(27790): 4933401J01Rik Gm38385 ... ZP3AR_Satellite_Muridae
##   ZP3AR_MM_Satellite_Mus
## rowData names(2): GeneNames GeneIDs
## colnames(2552): AAACCTGAGCTTATCG-1 AAACCTGGTTGAGTTC-1 ...
##   TTTGTCATCTGTCCGT-2 TTTGTCATCTTTCCTC-2
## colData names(12): Replicate TSNE-1 ... ScratPseudotime ScratRank
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

At this point, you can perform the following analyses from above:

#normalize
clusters <- quickCluster(sce)
sce <- computeSumFactors(sce, cluster=clusters)
sce <- logNormCounts(sce)

#feature selection
rowSubset(sce, "HVGs") <- getTopHVGs(sce, n=1000)

# PCA
sce <- runPCA(sce, subset_row=rowSubset(sce, "HVGs"))
percent.var <- attr(reducedDim(sce), "percentVar")
chosen.elbow <- findElbowPoint(percent.var)
reducedDim(sce, "PCA.elbow") <- reducedDim(sce, "PCA")[,1:chosen.elbow]

# runTSNE
sce <- runTSNE(sce, dimred="PCA.elbow")

Make a TSNE plot with the cell types from the paper.

tsne.CellType <- plotTSNE(sce,
                          colour_by="CellType")

tsne.CellType

# Save the tsne plot as a PNG file in the specified results directory
ggsave(tsne.CellType, file = paste0(results_dir, "/TSNE_Lukassen_CellTypes.png"), width = 4, height=4)

In this case, the colData already includes the first two TSNE coordinates identified by the authors, so there isn’t really a need to do anything other than Normalization and Transformation in order to repeat their analyses. But including this information is uncommon.

Plot the author’s TSNE coordinates.

# Extract the t-SNE coordinates (TSNE-1 and TSNE-2 columns) from the 'sup_cells' data frame
paper_tsne <- sup_cells[, c("TSNE-1", "TSNE-2")]

# Assign the extracted t-SNE coordinates to the 'sce' SingleCellExperiment object
# This adds 'paper_tsne' as a reduced dimension slot in 'sce' named "paper_tsne"
reducedDim(sce, "paper_tsne") <- paper_tsne

paper_tsne.CellType <- plotReducedDim(sce,
                                      colour_by ="CellType",
                                      dimred = "paper_tsne")

paper_tsne.CellType

# Save the tsne plot as a PNG file in the specified results directory
ggsave(paper_tsne.CellType, file = paste0(results_dir, "/Lukassen_TSNE_Lukassen_CellTypes.png"), width = 4, height=4)

Get the average expression of each gene in each cluster.

mean_sce <- summarizeAssayByGroup(sce, 
                                 ids = sce$CellType,
                                 statistics = "mean",
                                 assay.type = "logcounts")

# pull the mean assay into a data frame
mean_df <- assay(mean_sce, "mean")

head(mean_df)

##                        CS         ES    Leydig        RS1        RS2        SC1
## 4933401J01Rik 0.000000000 0.01570434 0.0000000 0.01466208 0.01895485 0.00000000
## Gm38385       0.000000000 0.00000000 0.0000000 0.00000000 0.00000000 0.00000000
## Rp1           0.026417045 0.09468464 0.0000000 0.12953817 0.10351288 0.34370331
## Sox17         0.023717340 0.04022804 0.0000000 0.29172019 0.07635653 0.01826828
## Gm37587       0.018696526 0.03016153 0.4500544 0.12399087 0.09324875 0.01535353
## Gm37323       0.007725922 0.07567644 0.0000000 0.00183389 0.01159152 0.00000000
##                       SC2 Sertoli       Spg
## 4933401J01Rik 0.000000000       0 0.0000000
## Gm38385       0.003460472       0 0.0000000
## Rp1           0.322834514       0 0.0000000
## Sox17         0.415854932       0 0.0000000
## Gm37587       0.134952815       0 0.1553769
## Gm37323       0.002597593       0 0.0000000

# write it to a csv file
write.csv(mean_df, file=paste0(results_dir, "/mean_logcounts_by_Lukassen_CellTypes.csv"))

Session Info

sessionInfo()

## R version 4.3.2 (2023-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS/LAPACK: FlexiBLAS OPENBLAS;  LAPACK version 3.11.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: US/Eastern
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] readxl_1.4.3                RColorBrewer_1.1-3         
##  [3] pheatmap_1.0.12             AUCell_1.24.0              
##  [5] scDblFinder_1.16.0          bluster_1.12.0             
##  [7] PCAtools_2.14.0             ggrepel_0.9.4              
##  [9] patchwork_1.1.3             Matrix_1.6-4               
## [11] DropletUtils_1.22.0         scran_1.30.2               
## [13] scater_1.30.1               scuttle_1.12.0             
## [15] SingleCellExperiment_1.24.0 SummarizedExperiment_1.32.0
## [17] Biobase_2.62.0              GenomicRanges_1.54.1       
## [19] GenomeInfoDb_1.38.8         IRanges_2.36.0             
## [21] S4Vectors_0.40.2            BiocGenerics_0.48.1        
## [23] MatrixGenerics_1.14.0       matrixStats_1.1.0          
## [25] lubridate_1.9.3             forcats_1.0.0              
## [27] stringr_1.5.0               dplyr_1.1.4                
## [29] purrr_1.0.2                 readr_2.1.4                
## [31] tidyr_1.3.0                 tibble_3.2.1               
## [33] ggplot2_3.4.4               tidyverse_2.0.0            
## 
## loaded via a namespace (and not attached):
##   [1] splines_4.3.2             later_1.3.1              
##   [3] BiocIO_1.12.0             bitops_1.0-7             
##   [5] R.oo_1.25.0               cellranger_1.1.0         
##   [7] graph_1.80.0              XML_3.99-0.16            
##   [9] lifecycle_1.0.3           edgeR_4.0.16             
##  [11] lattice_0.22-5            MASS_7.3-60              
##  [13] magrittr_2.0.3            plotly_4.10.3            
##  [15] limma_3.58.1              sass_0.4.7               
##  [17] rmarkdown_2.25            jquerylib_0.1.4          
##  [19] yaml_2.3.7                metapod_1.10.1           
##  [21] httpuv_1.6.12             cowplot_1.1.1            
##  [23] DBI_1.1.3                 abind_1.4-5              
##  [25] zlibbioc_1.48.2           Rtsne_0.17               
##  [27] mixtools_2.0.0            R.utils_2.12.3           
##  [29] RCurl_1.98-1.13           GenomeInfoDbData_1.2.11  
##  [31] irlba_2.3.5.1             annotate_1.80.0          
##  [33] dqrng_0.4.0               DelayedMatrixStats_1.24.0
##  [35] codetools_0.2-19          DelayedArray_0.28.0      
##  [37] tidyselect_1.2.0          farver_2.1.1             
##  [39] ScaledMatrix_1.10.0       viridis_0.6.4            
##  [41] GenomicAlignments_1.38.2  jsonlite_1.8.7           
##  [43] BiocNeighbors_1.20.2      ellipsis_0.3.2           
##  [45] survival_3.5-7            systemfonts_1.0.5        
##  [47] segmented_2.0-0           tools_4.3.2              
##  [49] ragg_1.2.6                Rcpp_1.0.11              
##  [51] glue_1.6.2                gridExtra_2.3            
##  [53] SparseArray_1.2.4         xfun_0.41                
##  [55] HDF5Array_1.30.1          withr_2.5.2              
##  [57] fastmap_1.1.1             rhdf5filters_1.14.1      
##  [59] fansi_1.0.5               digest_0.6.33            
##  [61] rsvd_1.0.5                timechange_0.2.0         
##  [63] R6_2.5.1                  mime_0.12                
##  [65] textshaping_0.3.7         colorspace_2.1-0         
##  [67] RSQLite_2.3.4             R.methodsS3_1.8.2        
##  [69] utf8_1.2.4                generics_0.1.3           
##  [71] data.table_1.14.10        FNN_1.1.3.2              
##  [73] rtracklayer_1.62.0        htmlwidgets_1.6.2        
##  [75] httr_1.4.7                S4Arrays_1.2.1           
##  [77] uwot_0.2.2                pkgconfig_2.0.3          
##  [79] gtable_0.3.4              blob_1.2.4               
##  [81] XVector_0.42.0            htmltools_0.5.7          
##  [83] GSEABase_1.64.0           scales_1.3.0             
##  [85] png_0.1-8                 knitr_1.45               
##  [87] rstudioapi_0.15.0         tzdb_0.4.0               
##  [89] reshape2_1.4.4            rjson_0.2.21             
##  [91] nlme_3.1-164              cachem_1.0.8             
##  [93] rhdf5_2.46.1              parallel_4.3.2           
##  [95] vipor_0.4.5               AnnotationDbi_1.64.1     
##  [97] restfulr_0.0.15           pillar_1.9.0             
##  [99] grid_4.3.2                vctrs_0.6.4              
## [101] promises_1.2.1            BiocSingular_1.18.0      
## [103] beachmat_2.18.1           xtable_1.8-4             
## [105] cluster_2.1.6             beeswarm_0.4.0           
## [107] evaluate_0.23             cli_3.6.1                
## [109] locfit_1.5-9.8            compiler_4.3.2           
## [111] Rsamtools_2.18.0          rlang_1.1.2              
## [113] crayon_1.5.2              labeling_0.4.3           
## [115] plyr_1.8.9                ggbeeswarm_0.7.2         
## [117] stringi_1.7.12            viridisLite_0.4.2        
## [119] BiocParallel_1.36.0       munsell_0.5.0            
## [121] Biostrings_2.70.3         lazyeval_0.2.2           
## [123] hms_1.1.3                 sparseMatrixStats_1.14.0 
## [125] bit64_4.0.5               Rhdf5lib_1.24.2          
## [127] KEGGREST_1.42.0           statmod_1.5.0            
## [129] shiny_1.7.5.1             highr_0.10               
## [131] kernlab_0.9-32            igraph_1.5.1             
## [133] memoise_2.0.1             bslib_0.5.1              
## [135] bit_4.0.5                 xgboost_1.7.6.1

scRNA-seq analysis in R

Stephanie Hickey

2024-07-23

OSCA: Orchestrating Single-Cell Analysis with Bioconductor

Outline

Set up

Open the R environment on HPCC

Load libraries

Set paths

Read in the count data

Make a `SingleCellExperiment` object

Key Features:

Detect empty droplets with `DropletUtils`

Knee plot

Distinguish empty droplets from cells

Identifying low quality cells

Summarize the stats

Filtering with fixed thresholds

Filtering with adaptive thresholds

Normalization and transformation

Feature selection

Dimesnionality reduction: PCA for downstream analysis

Clustering

Adjust \(k\) to change the cluster resolution

Dimesnionality reduction: UMAP and TSNE for visualization

Doublet detection

Identify cluster enriched genes

Assigning cell labels from gene sets

Assign lables to single cells

Assign lables to clusters

Mean expression tables

Start with deposited counts and supplied metadata

Session Info

scRNA-seq analysis in R

Stephanie Hickey

2024-07-23

OSCA: Orchestrating Single-Cell Analysis with Bioconductor

Outline

Set up

Open the R environment on HPCC

Load libraries

Set paths

Read in the count data

Make a SingleCellExperiment object

Key Features:

Detect empty droplets with DropletUtils

Knee plot

Distinguish empty droplets from cells

Identifying low quality cells

Summarize the stats

Filtering with fixed thresholds

Filtering with adaptive thresholds

Normalization and transformation

Feature selection

Dimesnionality reduction: PCA for downstream analysis

Clustering

Adjust \(k\) to change the cluster resolution

Dimesnionality reduction: UMAP and TSNE for visualization

Doublet detection

Identify cluster enriched genes

Assigning cell labels from gene sets

Assign lables to single cells

Assign lables to clusters

Mean expression tables

Start with deposited counts and supplied metadata

Session Info

Make a `SingleCellExperiment` object

Detect empty droplets with `DropletUtils`