Overview

The MSU HPCC, managed by ICER, offers a powerful environment for running bioinformatics analyses. This guide will walk you through running the nf-core/chipseq pipeline on the HPCC for reproducible and efficient ChIP-seq data analysis.

Key Benefits of nf-core/chipseq

nf-core/chipseq provides:

Prerequisites

Step-by-Step Tutorial

Note on Directory Variables

On the MSU HPCC:

Note on Working Directory

The working directory, which stores intermediate and temporary files, can be specified separately using the -w flag when running the pipeline. This helps keep your analysis outputs and temporary data organized.

1. Load Nextflow Module

Ensure Nextflow is available in your environment:

module load Nextflow

2. Create an Analysis Directory

Set up a dedicated directory for your analysis (referred to as the Analysis Directory):

mkdir $HOME/chipseq_project
cd $HOME/chipseq_project

3. Prepare Sample Sheet

Create a sample sheet (samplesheet.csv) with the following format:

sample,fastq_1,fastq_2,replicate,antibody
sample1,/path/to/sample1_R1.fastq.gz,/path/to/sample1_R2.fastq.gz,1,H3K27ac
sample2,/path/to/sample2_R1.fastq.gz,/path/to/sample2_R2.fastq.gz,1,H3K27me3

Ensure all paths to the FASTQ files are correct.

4. Configure ICER Environment

Create an icer.config file to run the pipeline with SLURM:

process {
    executor = 'slurm'
}

5. Run nf-core/chipseq

Example SLURM Job Submission Script

Below is a typical shell script for submitting an nf-core/chipseq job to SLURM:

#!/bin/bash

#SBATCH --job-name=chipseq_job
#SBATCH --time=48:00:00
#SBATCH --mem=32GB
#SBATCH --cpus-per-task=12

cd $HOME/chipseq_project
module load Nextflow/24.04.2

nextflow pull nf-core/chipseq
nextflow run nf-core/chipseq -r 2.1.0 --input ./samplesheet.csv -profile singularity --outdir ./chipseq_results --fasta ./Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz --gtf ./Homo_sapiens.GRCh38.108.gtf.gz -work-dir $SCRATCH/chipseq_work -c ./nextflow.config

Note on Reference Genomes

Common reference genomes can be found in the research common-data space on the HPCC. Refer to the README file in that directory for more details. Additionally, you can find guidance on downloading reference genomes from Ensembl in this GitHub repository.

Execute the pipeline with the following command. This example includes a -w flag to specify a working directory in the user’s scratch space for intermediate files:

nextflow run nf-core/chipseq -profile singularity --input samplesheet.csv --genome GRCh38 -c icer.config -w $SCRATCH/chipseq_project

6. Monitor and Manage the Run

Best Practices

Getting Help

If you encounter any issues or have questions while running nf-core/chipseq on the HPCC, consider the following resources:

Conclusion

Running nf-core/chipseq on the MSU HPCC is streamlined with Singularity and Nextflow modules. This setup supports reproducible, efficient, and large-scale ChIP-seq analyses. By following this guide, you can take full advantage of the HPCC’s computing power for your bioinformatics projects.

November 04, 2024   John Vusich, Leah Terrian