Running nf-core/chipseq on MSU HPCC

Overview

This guide will walk you through running the nf-core/chipseq pipeline on the HPCC for reproducible and efficient ChIP-seq data analysis.

Key Benefits of nf-core/chipseq

Reproducible ChIP-seq Analysis: A robust, community-maintained pipeline.
Portability: Runs smoothly on various computing infrastructures.
Scalability: Handles both small and large ChIP-seq datasets.

Prerequisites

Access to MSU HPCC with a valid ICER account.
Basic familiarity with the command line.

Note on Directory Variables

On the MSU HPCC:

$HOME refers to the user’s home directory (/mnt/home/username).
$SCRATCH refers to the user’s scratch directory, ideal for temporary files and large data processing.

Note on Working Directory

The working directory, where intermediate and temporary files are stored, can be specified using the -w flag when running the pipeline. This helps keep outputs and temporary data organized.

Step-by-Step Tutorial

1. Create a Project Directory

Make a new folder for your ChIP-seq analysis:

mkdir $HOME/chipseq
cd $HOME/chipseq

This command creates the directory and moves you into it.

2. Prepare a Sample Sheet

You need to create a file called samplesheet.csv that lists your samples and their FASTQ file paths. Use a text editor (like nano) to create this file:

nano samplesheet.csv

Then, add your sample information in CSV format. For example:

sample,fastq_1,fastq_2,replicate,antibody
sample1,/path/to/sample1_R1.fastq.gz,/path/to/sample1_R2.fastq.gz,1,H3K27ac
sample2,/path/to/sample2_R1.fastq.gz,/path/to/sample2_R2.fastq.gz,1,H3K27me3

Save the file (in nano, press Ctrl+O then Ctrl+X to exit).

3. Create a Configuration File

Do not type file content directly into the terminal. Use a text editor instead. Create a file named icer.config:

nano icer.config

Paste the following content into the file:

process {
    executor = 'slurm'
}

Save and exit the editor.

4. Prepare the Job Submission Script

Now, create a shell script to run the pipeline. Create a file called run_chipseq.sh:

nano run_chipseq.sh

Paste in the following script:

#!/bin/bash --login
#SBATCH --job-name=chipseq
#SBATCH --time=24:00:00
#SBATCH --mem=4GB
#SBATCH --cpus-per-task=1
#SBATCH --output=chipseq-%j.out

# Load Nextflow
module purge
module load Nextflow

# Set the paths to the genome files
GENOME_DIR="/mnt/research/common-data/Bio/genomes/Ensembl_GRCm39_mm39" #Example GRCm39
FASTA="$GENOME_DIR/genome.fa" # Example FASTA
GTF="$GENOME_DIR/genes.gtf" # Example GTF

# Define the samplesheet, outdir, workdir, and config
SAMPLESHEET="$HOME/chipseq/samplesheet.csv" # Example path to sample sheet
OUTDIR="$HOME/chipseq/results" # Example path to results directory
WORKDIR="$SCRATCH/chipseq/work" # Example path to work directory
CONFIG="$HOME/chipseq/icer.config" # Example path to icer.config file

# Run the ChIP-seq analysis
nextflow pull nf-core/chipseq
nextflow run nf-core/chipseq -r 2.1.0 -profile singularity -work-dir $WORKDIR -resume \
--input $SAMPLESHEET \
--outdir $OUTDIR \
--fasta $FASTA \
--gtf $GTF \
-c $CONFIG

Make edits as needed. Save and close the file.

5. Submit Your Job

Submit your job to SLURM by typing:

sbatch run_chipseq.sh

This sends your job to the scheduler on the HPCC.

6. Monitor Your Job

Check the status of your job with:

squeue -u $USER

After completion, your output files will be in the results folder inside your chipseq directory.

Note on Reference Genomes

Common reference genomes can be found in the research common-data space on the HPCC. Refer to the README file in that directory for more details. Additionally, you can find guidance on downloading reference genomes from Ensembl in this GitHub repository.

Best Practices

Check Logs: Regularly inspect log files generated by the pipeline for any warnings or errors.
Resource Allocation: Adjust the icer.config to optimize resource usage based on dataset size.
Storage Management: Ensure adequate storage space for intermediate and final results.

Getting Help

If you encounter any issues or have questions while running nf-core/chipseq on the HPCC, consider the following resources:

nf-core Community: Visit the nf-core website for documentation and support.
ICER Support: Contact ICER via the MSU ICER support page.
Slack Channel: Join the nf-core Slack for real-time assistance.
Nextflow Documentation: See the Nextflow documentation for further details.

Conclusion

Running nf-core/chipseq on the MSU HPCC is streamlined with Singularity and Nextflow modules. This setup supports reproducible, efficient, and large-scale ChIP-seq analyses. By following this guide, you can take full advantage of the HPCC’s computing power for your bioinformatics projects.

November 04, 2024 John Vusich, Leah Terrian, Nicholas Panchy