Overview

The MSU HPCC, managed by ICER, provides an efficient and scalable environment for running complex bioinformatics analyses. This tutorial will guide you through running the nf-core/atac-seq pipeline on the HPCC, ensuring reproducibility and optimal performance.

Key Benefits of nf-core/atacseq

nf-core/atac-seq is designed for:

Prerequisites

Step-by-Step Tutorial

Note on Directory Variables

On the MSU HPCC:

Note on Working Directory

The working directory, which stores intermediate and temporary files, can be specified separately using the -w flag when running the pipeline. This helps keep your analysis outputs and temporary data organized.

1. Load Nextflow Module

Ensure that Nextflow is available in your environment:

module load Nextflow

2. Create an Analysis Directory

Set up a dedicated directory for your analysis (referred to as the Analysis Directory):

mkdir $HOME/atacseq_project
cd $HOME/atacseq_project

3. Prepare Sample Sheet

Create a sample sheet (samplesheet.csv) with the following format:

sample,fastq_1,fastq_2,replicate
sample1,/path/to/sample1_R1.fastq.gz,/path/to/sample1_R2.fastq.gz,1
sample2,/path/to/sample2_R1.fastq.gz,/path/to/sample2_R2.fastq.gz,1

Ensure all paths to the FASTQ files are correct.

4. Configure ICER Environment

Create a nextflow.config file to run the pipeline with SLURM:

process {
    executor = 'slurm'
}

5. Create Bash Script

Create a submit_atacseq_job.sh file. You can copy and paste the below script, but note that you will have to modify the --outdir, --fasta, and --gtf to match your output and reference genome paths. This is a typical shell script for submitting an nf-core/atacseq job to SLURM:

#!/bin/bash --login

#SBATCH --job-name=atacseq_job
#SBATCH --time=24:00:00
#SBATCH --mem=24GB
#SBATCH --cpus-per-task=8

cd $HOME/atacseq_project
module load Nextflow/24.04.2

nextflow pull nf-core/atacseq
nextflow run nf-core/atacseq -r 2.1.2 --read_length 150 --input ./samplesheet.csv -profile singularity --outdir ./atacseq_results --fasta ./Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz --gtf ./Homo_sapiens.GRCh38.108.gtf.gz -work-dir $SCRATCH/atacseq_work -c ./nextflow.config

6. Run Bash Script with SLURM

In the terminal:

sbatch submit_atacseq_job.sh

7. Monitor and Manage the Run

Note on Reference Genomes

Common reference genomes can be found in the /mnt/research/common-data/Bio/ folder on the HPCC. You can find guidance on finding reference genomes on the HPCC or downloading them from Ensembl in this GitHub repository.

Execute the pipeline with the following command. This example includes a -w flag to specify a working directory in the user’s scratch space for intermediate files:

nextflow run nf-core/atacseq -profile singularity --input samplesheet.csv --genome GRCh38 -c nextflow.config -w $SCRATCH/atacseq_project

Best Practices

Getting Help

If you encounter any issues or have questions while running nf-core/atacseq on the HPCC, consider the following resources:

Conclusion

Running nf-core/atac-seq on the MSU HPCC is streamlined with Singularity and Nextflow modules. This setup supports reproducible, efficient, and large-scale ATAC-seq analyses. By following this guide, you can take full advantage of the HPCC’s computing power for your bioinformatics projects.

November 03, 2024   John Vusich, Leah Terrian