HPCC Data Handling Guide: Uploading, Storing, and Sharing Sequencing Data

Getting Data on the HPCC

Welcome to the HPCC data handling guide. This documentation is designed to help new users understand the procedures for uploading, storing, and sharing data within the High-Performance Computing Cluster (HPCC) environment. It covers how to bring new sequencing data into the HPCC, access existing data, and share data with collaborators or the public through NCBI and GEO.

1. New Sequencing Data

To upload new sequencing data to the HPCC, follow these methods:

FTP (File Transfer Protocol)

Use an FTP client to connect to the HPCC server. Credentials and server details should be provided by your system administrator.
Navigate to your designated directory on the server and upload your sequencing data.

Hard Drive

If data is on a physical hard drive, connect it to a workstation that has access to the HPCC.
Use secure copy (SCP) or rsync commands to transfer data from the hard drive to the HPCC storage. Ensure data integrity by verifying file checksums post-transfer.

Globus

For large datasets, use Globus, a reliable file transfer service.
Ensure you have access to a Globus endpoint on the HPCC. Transfer data by selecting source and destination endpoints, then initiating the transfer through the Globus web interface.

2. Existing Sequencing Data

nf-core/fetchngs

The nf-core/fetchngs tool can be used to fetch sequencing data from repositories directly to the HPCC.
Configure the tool with your project details and data accession numbers to automate the download process.

NCBI, GEO, SRA

For data stored in NCBI, GEO, or SRA, use the SRA-toolkit to download the required datasets.
Commands such as fastq-dump can be used to retrieve data in FASTQ format, which is standard for sequencing data.

1. Uploading to NCBI and Submitting Data to GEO

Preparing Data for Submission

Ensure that your data is properly annotated and organized. Metadata should include details about the experimental setup, sample information, and sequencing parameters.
Data should be cleaned and validated to meet the submission requirements of NCBI and GEO.

NCBI Submission

Create an account on NCBI and access the submission portal.
Follow the guidelines for submitting raw data (FASTQ files) and associated metadata. Be sure to complete all required fields to ensure your submission is comprehensive.

GEO Submission

To submit data to GEO, use the GEO submission portal. GEO accepts processed data (like gene expression matrices) and raw data (like FASTQ files).
Include comprehensive metadata and a detailed description of the study to facilitate data reuse.

Additional Notes

Always backup your data before transferring or submitting to avoid data loss.
Regularly update your local and HPCC data inventories to keep track of what has been uploaded, processed, or shared.
Submit a ticket to ICER for questions related to data handling and security.

By following these guidelines, you can efficiently manage your sequencing data within the HPCC environment and ensure it is accessible for your research and collaborations.

August 03, 2024 John Vusich, Leah Terrian