Overview
This guide details how to run the nf-core/sarek pipeline for whole genome sequencing (WGS) analysis, ensuring efficient and reproducible workflows.
Key Benefits of nf-core/sarek
- Reproducible WGS Analysis: Supports germline and somatic variant calling.
- Portability: Runs seamlessly across various computing infrastructures.
- Scalability: Handles both small-scale and large-scale WGS datasets.
Prerequisites
- Access to MSU HPCC with a valid ICER account.
- Basic familiarity with the command line.
Note on Directory Variables
On the MSU HPCC:
$HOME
refers to the user’s home directory (/mnt/home/username
).$SCRATCH
refers to the user’s scratch directory, ideal for temporary files and large data processing.
Note on Working Directory
The working directory, where intermediate and temporary files are stored, can be specified using the -w
flag when running the pipeline. This keeps outputs and temporary data organized.
Step-by-Step Tutorial
1. Create a Project Directory
Make a new folder for your WGS analysis:
mkdir $HOME/sarek
cd $HOME/sarek
This command creates the directory and moves you into it.
2. Prepare Sample Sheet
You need to create a file called samplesheet.csv
that lists your samples and their FASTQ file paths. Create and edit the file in OnDemand or use a text editor (like nano) to create this file:
nano samplesheet.csv
Then, add your sample information in CSV format. For example:
subject,sex,tumor,fastq_1,fastq_2
sample1,male,yes,/path/to/sample1_R1.fastq.gz,/path/to/sample1_R2.fastq.gz
sample2,female,no,/path/to/sample2_R1.fastq.gz,/path/to/sample2_R2.fastq.gz
Save the file (in nano, press Ctrl+O then Ctrl+X to exit).
3. Create a Configuration File
Do not type file content directly into the terminal. Use a text editor instead. Create a file named icer.config:
nano icer.config
Paste the following content into the file:
process {
executor = 'slurm'
}
Save and exit the editor.
4. Prepare the Job Submission Script
Now, create a shell script to run the pipeline. Create a file called run_sarek.sh:
nano run_sarek.sh
Paste in the following script:
#!/bin/bash --login
#SBATCH --job-name=sarek
#SBATCH --time=24:00:00
#SBATCH --mem=4GB
#SBATCH --cpus-per-task=1
#SBATCH --output=sarek-%j.out
# Load Nextflow
module purge
module load Nextflow
# Set the paths to the genome files
GENOME_DIR="/mnt/research/common-data/Bio/genomes/Ensembl_GRCm39_mm39" #Example GRCm39
FASTA="$GENOME_DIR/genome.fa" # Example FASTA
GTF="$GENOME_DIR/genes.gtf" # Example GTF
# Define the samplesheet, outdir, workdir, and config
SAMPLESHEET="$HOME/sarek/samplesheet.csv" # Example path to sample sheet
OUTDIR="$HOME/sarek/results" # Example path to results directory
WORKDIR="$SCRATCH/sarek/work" # Example path to work directory
CONFIG="$HOME/sarek/icer.config" # Example path to icer.config file
# Run the WGS analysis
nextflow pull nf-core/sarek
nextflow run nf-core/sarek -r 3.5.1 -profile singularity -work-dir $WORKDIR -resume \
--input $SAMPLESHEET \
--outdir $OUTDIR \
--fasta $FASTA \
--gtf $GTF \
--tools cnvkit,strelka \
-c $CONFIG
Make edits as needed. Save and close the file.
5. Submit Your Job
Submit your job to SLURM by typing:
sbatch run_sarek.sh
This sends your job to the scheduler on the HPCC.
6. Monitor Your Job
Check the status of your job with:
squeue -u $USER
After completion, your output files will be in the results
folder inside your sarek
directory.
Note on Sarek Starting Step and Tools
The nf-core/sarek
pipeline gives the option to use different variant calling tools and start at different steps in the analysis. By default, nf-core/sarek starts at the pre-processing/mapping step and uses Strelka
for variant calling. Read here and here for more information on starting at different steps in the analysis. Here is a note on the compatibility of the available Sarek tools that can be designated using the --tools
parameter. For more information and to see the available tools and input format, go to nf-core/sarek parameters and click the Help text
option under --tools
. To see the available starting steps and input format, go to nf-core/sarek parameters and click the drop down starting with mapping (default)
next to the --step
parameter. Check out the nf-core/sarek
troubleshooting documentation.
Note on Reference Genomes
Common reference genomes are located in the research common-data space on the HPCC. Refer to the README file for details. For more guidance on downloading reference genomes from Ensembl, see this GitHub repository.
Best Practices
- Review Logs: Regularly check log files for warnings or errors.
- Optimize Resource Usage: Adjust
icer.config
to match your dataset requirements. - Manage Storage: Ensure ample storage for intermediate and final data.
Getting Help
If you encounter issues running nf-core/sarek on the HPCC, consult the following resources:
- nf-core Community: Visit the nf-core website for documentation and support.
- ICER Support: Contact ICER via the MSU ICER support page.
- Slack Channel: Join the nf-core Slack for real-time help.
- Nextflow Documentation: See the Nextflow documentation for additional details.
Conclusion
Running nf-core/sarek on the MSU HPCC is streamlined using Singularity and Nextflow. This guide ensures reproducible and scalable WGS analysis, maximizing the HPCC’s computational resources for bioinformatics research.