Overview

The MSU HPCC, managed by ICER, provides a robust platform for running bioinformatics pipelines. This guide details how to run the nf-core/sarek pipeline for whole genome sequencing (WGS) analysis, ensuring efficient and reproducible workflows.

Key Benefits of nf-core/sarek

nf-core/sarek offers:

Prerequisites

Step-by-Step Tutorial

Note on Directory Variables

On the MSU HPCC:

Note on Working Directory

The working directory, where intermediate and temporary files are stored, can be specified using the -w flag when running the pipeline. This keeps outputs and temporary data organized.

1. Load Nextflow Module

Ensure Nextflow is loaded:

module load Nextflow

2. Create an Analysis Directory

Set up a directory for your analysis (referred to as the Analysis Directory):

mkdir $HOME/sarek_project
cd $HOME/sarek_project

3. Prepare Sample Sheet

Create a sample sheet (samplesheet.csv) with the following format:

subject,sex,tumor,fastq_1,fastq_2
sample1,male,yes,/path/to/sample1_R1.fastq.gz,/path/to/sample1_R2.fastq.gz
sample2,female,no,/path/to/sample2_R1.fastq.gz,/path/to/sample2_R2.fastq.gz

Ensure all paths to the FASTQ files are accurate.

4. Configure ICER Environment

Create an icer.config file for SLURM:

process {
    executor = 'slurm'
}

5. Run nf-core/sarek

Example SLURM Job Submission Script

Below is a shell script for submitting an nf-core/sarek job to SLURM:

#!/bin/bash

#SBATCH --job-name=sarek_job
#SBATCH --time=72:00:00
#SBATCH --mem=64GB
#SBATCH --cpus-per-task=16

cd $HOME/sarek_project
module load Nextflow/23.10.0

nextflow pull nf-core/sarek
nextflow run nf-core/sarek -r 3.14.0 --input ./samplesheet.csv -profile singularity --outdir ./sarek_results --genome GRCh38 -work-dir $SCRATCH/sarek_work -c ./nextflow.config

Note on Reference Genomes

Common reference genomes are located in the research common-data space on the HPCC. Refer to the README file for details. For more guidance on downloading reference genomes from Ensembl, see this GitHub repository.

Execute the pipeline with the following command, including the -w flag for a separate working directory:

nextflow run nf-core/sarek -profile singularity --input samplesheet.csv --genome GRCh38 -c icer.config -w $SCRATCH/sarek_project

6. Monitor and Manage the Run

Best Practices

Getting Help

If you encounter issues running nf-core/sarek on the HPCC, consult the following resources:

Conclusion

Running nf-core/sarek on the MSU HPCC is streamlined using Singularity and Nextflow. This guide ensures reproducible and scalable WGS analysis, maximizing the HPCC’s computational resources for bioinformatics research.

November 04, 2024   John Vusich