In this section you will:
prefetch and fasterq dump
functions from the SRA-toolkit
to download fastq files from the SRA.kb ref to download the pre-made mouse reference
index.kb-count to get cell by gene count data.wget to download processed data from GEO.We’ll use raw data from Lukassen et al. Single-cell RNA sequencing of adult mouse testes which is deposited in the SRA (SRP119327). There is one entry for each mouse with three different files per mouse.
To do this on hpc we’ll use the prefetch and
fasterq dump functions from the SRA-toolkit.
# make a subfolder for this data in scratch
mkdir $SCRATCH/BCC105_scRNAseq_training/
mkdir $SCRATCH/BCC105_scRNAseq_training/raw_data/
mkdir $SCRATCH/BCC105_scRNAseq_training/raw_data/Lukassen_testes
Do not run this example The following script is in
the slurm submission file prefetch_fasterq_dump.sh.
# load the SRA-toolkit module
module purge
module load SRA-Toolkit/3.0.10-gompi-2023a
# save the sra file to the appropriate data directory
prefetch SRR6129050 -O $SCRATCH/BCC105_scRNAseq_training/raw_data/Lukassen_testes
prefetch SRR6129051 -O $SCRATCH/BCC105_scRNAseq_training/raw_data/Lukassen_testes
# extract the fastq files from the sra file
cd $SCRATCH/BCC105_scRNAseq_training/raw_data/Lukassen_testes
fasterq-dump SRR6129050
fasterq-dump SRR6129051
Because it can take a long time, it’s best to submit this code as a
job to slurm. Just open prefetch_fasterq_dump.sh and modify
the code before running.
# set your working directory
WORK=/mnt/research/bioinformaticsCore/projects/petroffm/BCC105_scRNAseq_training
# make a `run` folder to hold the output files generated
# when you run your code
mkdir $WORK/run
cd $WORK/run
sbatch $WORK/src/prefetch_fasterq_dump.sh
Kallisto | bustools (kb) is a suite of tools designed for the efficient and accurate quantification of single-cell RNA-seq data. It combines the speed of Kallisto, a pseudo-alignment tool, with the power of Bustools, which handles the processing of barcodes and UMIs (Unique Molecular Identifiers)
kb ref to download the pre-made mouse reference
indexThe standard index is packaged with the following two files:
index.idx: The kallisto index that reads will be
pseudoaligned against.t2g.txt: A file containing the transcript-to-gene
mapping.Arguments:
-i: What you want to name index.idx.-g: What you want to name t2g.txt.-d: The name of the pre-built kallisto index you want
to download.Get more information about the pre-made indices here. You can also use this function to build your own index. You only need to do this once per genome.
# load the Conda module
module purge
module load Conda/3
# if you haven't already
cd $WORK/data
kb ref -i kallisto_mouse.idx -g kallisto_mouse_t2g.txt -d mouse
kb-count to get cell by gene count datakb-count
is a flexible pseudoaligner that works with many single-cell RNA-seq
technologies.
To see all of the parameters for running kb-count, click
the link above or run:
kb count -h
To list the technologies it supports run the following command:
kb --list
10X Chromium 3’ v2 chemistry was used to generate the testis data, so
we’ll use -x 10XV2 to specify this.
Make folders in $WORK/data to hold the
kb count output for each sample
# paper dir
mkdir $WORK/data/Lukassen_testes
# mouse1
mkdir $WORK/data/Lukassen_testes/SRR6129050_mouse1
# mouse2
mkdir $WORK/data/Lukassen_testes/SRR6129051_mouse2
kb-count Arguments
-o: Output directory, the folder where you want the
count data to go.-t: Threads for running in parallel. Must be <=
--cpus-per-task in the slurm submssion file.-i Path to kallisto index/indices,
comma-delimited.-g Path to transcript-to-gene mapping.-x Single-cell technology used (kb --list
to view).--filter used to filter out barcodes with low UMI
counts.fastq files are listed at the
end.Do not run this example The following script is in
the slurm submission file mouse1_kb_count.sh.
# load the conda module
module purge
module load Conda/3
kb count \
-o $WORK/data/Lukassen_testes/SRR6129050_mouse1 \
-t 32 \
-i $WORK/data/kallisto_mouse.idx \
-g $WORK/data/kallisto_mouse_t2g.txt \
-x 10XV2 \
--filter \
$SCRATCH/BCC105_scRNAseq_training/raw_data/Lukassen_testes/SRR6129050_1.fastq \
$SCRATCH/BCC105_scRNAseq_training/raw_data/Lukassen_testes/SRR6129050_2.fastq
Edit mouse1_kb_count.sh to replace the file paths with
your own paths. Then submit mouse1_kb_count.sh:
cd $WORK/run
sbatch $WORK/src/mouse1_kb_count.sh
Make a copy of mouse1_kb_count.sh, and edit it to run
the script for mouse2.
Submit mouse2_kb_count.sh
cd $WORK/run
sbatch $WORK/src/mouse2_kb_count.sh
If the processed counts from the paper are already available, use those instead. For Lukassen et al., the cout matrix, genes, and cell barcodes are under GSE104556.
cd $WORK/data/Lukassen_testes
# this will download all of the .gz files under `Supplementary File`
# this template works for downloading any suppl file
# from GEO
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE104nnn/GSE104556/suppl/*.gz
Part2: Analyze the count data in R