Overview

The MSU HPCC, managed by ICER, provides an efficient and scalable environment for running complex bioinformatics analyses. This tutorial will guide you through running the nf-core/atac-seq pipeline on the HPCC, ensuring reproducibility and optimal performance.

Key Benefits of nf-core/atacseq

nf-core/atac-seq is designed for:

Prerequisites

Step-by-Step Tutorial

Note on Directory Variables

On the MSU HPCC:

Note on Working Directory

The working directory, which stores intermediate and temporary files, can be specified separately using the -w flag when running the pipeline. This helps keep your analysis outputs and temporary data organized.

1. Load Nextflow Module

Ensure Nextflow is available in your environment:

module load Nextflow

2. Create an Analysis Directory

Set up a dedicated directory for your analysis (referred to as the Analysis Directory):

mkdir $HOME/atacseq_project
cd $HOME/atacseq_project

3. Prepare Sample Sheet

Create a sample sheet (samplesheet.csv) with the following format:

sample,fastq_1,fastq_2,replicate
sample1,/path/to/sample1_R1.fastq.gz,/path/to/sample1_R2.fastq.gz,1
sample2,/path/to/sample2_R1.fastq.gz,/path/to/sample2_R2.fastq.gz,1

Ensure all paths to the FASTQ files are correct.

4. Configure ICER Environment

Create an icer.config file to run the pipeline with SLURM:

process {
    executor = 'slurm'
}

5. Run nf-core/atacseq

Example SLURM Job Submission Script

Below is a typical shell script for submitting an nf-core/atacseq job to SLURM:

#!/bin/bash

#SBATCH --job-name=atacseq_job
#SBATCH --time=24:00:00
#SBATCH --mem=24GB
#SBATCH --cpus-per-task=8

cd $HOME/atacseq_project
module load Nextflow/23.10.0

nextflow pull nf-core/atacseq
nextflow run nf-core/atacseq -r 3.14.0 --input ./samplesheet.csv -profile singularity --outdir ./atacseq_results --fasta ./Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz --gtf ./Homo_sapiens.GRCh38.108.gtf.gz -work-dir $SCRATCH/atacseq_work -c ./nextflow.config

Note on Reference Genomes

Common reference genomes can be found in the research common-data space on the HPCC. Refer to the README file in that directory for more details. Additionally, you can find guidance on downloading reference genomes from Ensembl in this GitHub repository.

Execute the pipeline with the following command. This example includes a -w flag to specify a working directory in the user’s scratch space for intermediate files:

nextflow run nf-core/atacseq -profile singularity --input samplesheet.csv --genome GRCh38 -c icer.config -w $SCRATCH/atacseq_project

6. Monitor and Manage the Run

Best Practices

Getting Help

If you encounter any issues or have questions while running nf-core/atacseq on the HPCC, consider the following resources:

Conclusion

Running nf-core/atac-seq on the MSU HPCC is streamlined with Singularity and Nextflow modules. This setup supports reproducible, efficient, and large-scale ATAC-seq analyses. By following this guide, you can take full advantage of the HPCC’s computing power for your bioinformatics projects.

November 03, 2024   John Vusich