Overview

This guide will show you, step by step, how to analyze bulk RNA-seq data using the nf-core/rnaseq pipeline (for QC, alignment, and quantification) and the nf-core/differential-abundance pipeline for differential expression analysis and GSEA. Users can click here to skip directly to the differential abundance and GSEA steps if they already have a counts table.

Key Benefits

Prerequisites

Part 1: Pre-processing with nf-core/rnaseq

1. Create Your Project Directory

Make a new folder for your RNA-seq analysis:

mkdir $HOME/rnaseq_project
cd $HOME/rnaseq_project

This command creates the directory and moves you into it.

2. Prepare Your Sample Sheet

You need to create a file called samplesheet.csv that lists your samples and their FASTQ file paths. Use a text editor (like nano) to create this file:

nano samplesheet.csv

Then, add your sample information in CSV format. For example:

sample,fastq_1,fastq_2,strandness
CONTROL_REP1,/path/to/CONTROL_REP1_R1.fastq.gz,/path/to/CONTROL_REP1_R2.fastq.gz,auto
CONTROL_REP2,/path/to/CONTROL_REP2_R1.fastq.gz,/path/to/CONTROL_REP2_R2.fastq.gz,auto
TREATMENT_REP1,/path/to/TREATMENT_REP1_R1.fastq.gz,/path/to/TREATMENT_REP1_R2.fastq.gz,auto
TREATMENT_REP2,/path/to/TREATMENT_REP2_R1.fastq.gz,/path/to/TREATMENT_REP2_R2.fastq.gz,auto

Save the file (in nano, press Ctrl+O then Ctrl+X to exit).

3. Create the Configuration File

Do not type file content directly into the terminal—use a text editor instead. Create a file named icer.config:

nano icer.config

Paste the following content into the file:

process {
    executor = 'slurm'
}

Save and exit the editor.

4. Prepare the Job Submission Script

Now, create a shell script to run the pipeline. Create a file called run_rnaseq.sh:

nano run_rnaseq.sh

Paste in the following script:

#!/bin/bash --login

#SBATCH --job-name=rnaseq_job
#SBATCH --time=24:00:00
#SBATCH --mem=8GB
#SBATCH --cpus-per-task=4

cd $HOME/rnaseq_project
module purge
module load Nextflow/24.10.2

# Pull the latest nf-core/rnaseq pipeline
nextflow pull nf-core/rnaseq

# Run the pipeline
nextflow run nf-core/rnaseq -r 3.18.0 -profile singularity --input ./samplesheet.csv --outdir ./rnaseq_results --genome GRCh38 -work-dir $SCRATCH/rnaseq_work -c ./icer.config

Save and close the file.

5. Submit Your Job

Submit your job to SLURM by typing:

sbatch run_rnaseq.sh

This sends your job to the scheduler on the HPCC.

6. Monitor Your Job

Check the status of your job with:

squeue -u $USER

After completion, your output files will be in the rnaseq_results folder inside your project directory.

Part 2: Optional – Differential Expression Analysis

If you already have a counts table and want to perform differential expression and GSEA using nf-core/differential-abundance, follow these additional steps.

1. Create a New Project Directory

Create a separate folder for the differential expression analysis:

mkdir $HOME/diff_exp_project
cd $HOME/diff_exp_project

2. Prepare the Samplesheet and Input Files

Create a samplesheet.csv for differential analysis:

nano samplesheet.csv

Example content:

sample,condition,replicate,batch
CONTROL_REP1,control,1,A
CONTROL_REP2,control,2,B
TREATMENT_REP1,treated,1,A
TREATMENT_REP2,treated,2,B

Also, ensure you have:

Counts and transcript length files

The nf-core RNAseq workflow (described above) creates a raw counts and transcript length matrix file that can be used directly in the nf-core Differential Abundance workflow. Provide the paths to those files in the submission script in step 3 (below):

--matrix $HOME/rnaseq_project/rnaseq_results/star_salmon/salmon.merged.gene_counts.tsv \
--transcript_length_matrix $HOME/rnaseq_project/rnaseq_results/star_salmon/salmon.merged.gene_lengths.tsv

See this documentation for more information on RNAseq gene counts for nf-core/differentialabundance.

Contrasts file

The contrasts file defines the groups of samples to compare.

--contrasts '[path to contrasts file]'

Example contrasts.csv file:

id,variable,reference,target,blocking
condition_control_treated,condition,control,treated,
condition_control_treated_blockrep,condition,control,treated,replicate;batch

See this documentation for more information on the contrasts file for nf-core/differentialabundance.

3. Create the Job Submission Script for Differential Expression

Create a file called run_diff_exp.sh:

nano run_diff_exp.sh

Paste in the following script:

#!/bin/bash --login

#SBATCH --job-name=diff_exp_job
#SBATCH --time=3:59:00
#SBATCH --mem=32GB
#SBATCH --cpus-per-task=8

cd $HOME/diff_exp_project
module load Nextflow/24.10.2

# Pull the latest nf-core/differential-abundance pipeline
nextflow pull nf-core/differential-abundance

# Run the pipeline
nextflow run nf-core/differential-abundance -r 1.5.0 -profile singularity --input samplesheet.csv --matrix $HOME/rnaseq_project/rnaseq_results/star_salmon/salmon.merged.gene_counts.tsv  $HOME/rnaseq_project/rnaseq_results/star_salmon/salmon.merged.gene_lengths.tsv --outdir ./diff_exp_results -c ./icer.config

Save and close the file.

4. Submit the Differential Expression Job

Submit the job with:

sbatch run_diff_exp.sh

5. Monitor Your Job

Check job status with:

squeue -u $USER

Once finished, your differential expression results will be in the diff_exp_results folder.

Quick Reminders


Conclusion

Using nf-core/rnaseq and nf-core/differentialabundance on the MSU HPCC simplifies the process of bulk RNAseq analysis, including QC, alignment, quantification, and downstream analysis. The combination of Singularity and Nextflow ensures a reproducible and efficient workflow tailored for high-performance computing environments.


Getting Help

November 04, 2024   John Vusich, Leah Terrian, Nicholas Panchy