Category Archives: HPC

A brief update on resources required for long read assembly

Canu v1.6 resources required for 1.4Gb genome assembly

Currently running an assembly with ~60X coverage based on the RSII and Sequel long read data.

Raw data is first converted to fasta and quiver/arrow using Dextract (Dazzler suite). Quiver is a type of fastq but not sure if it can be used as input to canu or Marvel assemblers. I used the .fasta files for assembly with canu.
Steps in running Canu

  • Submit inital canu script to grid environment  -this step runs Meryl – which runs a kmer analysis, outputs histograms of read length and coverage -also auto submits scripts for batch jobs to the grid for correction stage.
  • After correction, canu auto submits a trimming batch job. This stage takes many resources and time!!!

-for the current job: 917 jobs within the array (these can be done in parrallel). Each job requires ~ 4 CPU, up to 16 gb ram and 60 – 70 hours walltime. The resources are difficult to schedule accurately because the first 10  or so jobs are often not resource hungry and will only take an hour or so. The resources required scale up dramatically and the last two hundred jobs can take as long as 80+ hours each.

  • After trimming, canu autosubmits a unitigging batch job. This stage also takes many resources and time!!! Same resources as correction.
  • Output a fasta assembly and statistics on the assembly.

If all goes well canu will run through all these process without interruption. However, my experience is that the resources need constant monitoring. Also, there is a bug in the shell script that it autosubmits that means if one job in the array fails, it cannot pick up from where is stopped. So individual jobs have to be re-run manually.
So to run the assembly on Canu you ideally need:
(900 X 4)  3600 dedicated CPUs(900 X 16) 14400 GB memory
With these UNREALISTIC resources you could have a genome in walltime: two and a half weeks. Realistically you will need to run jobs within the array in batches according to your resources available. To be clear here, Canu does a good job of the assembly and I applaud the authors of this software!! It is a clearly a very difficult problem to optimise the assembly process with our compute resources. I hope that teams are  working on this very difficult issue or perhaps we have to hope that quantum computing is just around the corner…
The first assemblies I ran with Canu with half the coverage (30X) took aproximately 2 months. These assemblies autosubmitted batch jobs of around 300 for the trimming and unitigging stages.
The following grid options more or less worked for the previous assemblies. The walltime needed some adjustment as the array job progressed.
gridOptionsCORMHAP=”-l walltime=16:00:00″ \

gridOptionsCOROVL=”-l walltime=16:00:00″ \

gridOptionsCOR=”-l walltime=16:00:00″ \

gridOptionsOBTOVL=”-l walltime=56:00:00 -l mem=16g -l nodes=1:ppn=5″ \

gridOptionsUTGOVL=”-l walltime=56:00:00 -l mem=16g -l nodes=1:ppn=5″ \

gridOptionsMERYL=”-l walltime=4:00:00″
PS. Marvel assembler is not looking very promising at this stage, although they claim to reduce resource requirement by an order of magnitude by softmasking prior to local alignment step. I may be able to make some adjustments to my script to help but currently, after 336 hours walltime 8 CPUs and 70 gb ram, the Daligner stage has only produced alignments (.las) for 29 out of 363 files.

False starts with PacBio reads

As mentioned previously, the raw reads from RSII and Sequel PacBio come as bax.h5 and subreads.bam respectively. Both the Falcon and Canu assembler take input files of Fasta or Fastq.

After reading a bit more we realized that we needed to extract the fasta as well as the arrow files for assembly and then polishing. Dextract from the Dazzler suite is the software to extract and make these files from the bax.h5 and subreads.bam.

The following explanation is what happened when we attempted to install dextract via the Falcon-integrate onto the hpc.

Found DEXTRACTOR was not on the HPC – thought it may have been installed as part of FALCON, which is there, but this was not the case. We then decided to install FALCON-integrate which includes DEXTRACTOR. We followed the intructions here:
export GIT_SYM_CACHE_DIR=~/.git-sym-cache # to speed things up
cd FALCON-integrate
git checkout master  # or whatever version you want
git submodule update –init # Note: You must do this yourself! No longer via `make init`.
make init
source env.sh
make config-edit-user
make -j all
make test  # to run a simple one
Needed to run “module load python” to get the correct python version (2.7.9.)
Found that dextract did not build. This was fixed by changing the hdf5 include and lib path in the makefile (DEXTRACTOR/Makefile)
Then found FALCON-integrate doesn’t have the later version of DEXTRACTOR which can work with .bam files, so we decided to do a stand alone install of the latest DEXTRACTOR.
cd DEXTRACTOR
git checkout master
make -f Makefile
The makefile needed to be edited to work with different zlib and hdf5 include and lib paths
(possibly could have avoided this by running “module load hdf5/1.8.16” and “module load zlib” beforehand)
Running dextract
Need to run
module load hdf5/1.8.16
module load zlib
beforehand
Example:
~/bin/DEXTRACTOR/dextract -f -a m54078_170627_071131.subreads.bam
generates a m54078_170627_071131.fasta and a m54078_170627_071131.arrow in the current directory
-f outputs a fasta file
-a outputs an arrow file

Starting out with your raw reads

You have your data – now what?

Your data will usually come to you from the sequencing facility as masses of short reads in zipped files. The files will be labelled something like this:

AP0_CGATGT_L001_R1_001.fastq.gz

AP0_CGATGT_L002_R1_001.fastq.gz

AP0_CGATGT_L001_R2_001.fastq.gz

AP0_CGATGT_L002_R2_001.fastq.gz

So for this sample AP0 – (plant labeled AP and from pre-inoculation therefore ‘0’), I have paired-end data. R1 is from the forward read and R2 is from the reverse read. The Illumina run was done using two lanes therefore I have two R1 and two R2 datasets (L001 and L002).

These zipped files are .fastq files. Fasta files are the basic text files that gene sequences are formatted in with headers beginning with a  ‘>’ followed by the sequence on the next line. If they are not too massive they can be opened in Notepad+++. Fastq files have additional information about the quality of the read which cannot be easily understood by humans but can allow software to sort through and discard poor reads. More information here,

https://en.wikipedia.org/wiki/FASTQ_format.

You want all the sample AP0 R1 reads from two sequencing lanes to be in one file. The same goes for the R2 reads. So you will need to unzip the files and then combine them explained below (in Back to the zipped raw data).

Unix and high performance computing

Most of the work involved in RNAseq analysis requires a Unix environment (command line) rather than Windows or graphic user interfaces. To run assembly and alignment software you also really need high performance computing (hpc) access. There are a number of ways to access these such as:

http://nci.org.au/access/getting-access-to-the-national-facility/allocation-schemes/

https://aws.amazon.com/hpc/resources/

https://nectar.org.au/

but as I have access to the university cluster I will be just discussing the methods I have used for Artemis, the University of Sydney hpc.

http://sydney.edu.au/research_support/hpc/access/

On my local Windows operating system I have installed PuTTY software (http://www.putty.org/) which allows me to log in to the hpc remotely using a VPN and operate the software with my data in a project directory allocated to me. I use Filezilla (https://filezilla-project.org/) for transferring much of my data from my local data storage location to the hpc and back.

PBS scripts

When using hpc clusters all jobs need to be scheduled and queued so that appropriate allocation of resources occurs. To run software on your data you submit a portable batch script (PBS) which specifies the software you need to use, the resources you need and the time expected to run the job (wall-time). It is often a bit of guesswork to know how long a job will take but the manuals for software, always available on-line, will give some guidance. Here is an example of information required in a PBS script for Artemis:

# Some useful brief comment on what this script does
#PBS -P RDS-FAC-PROJECT-RW – make sure to replace FAC and PROJECT to suit you
#PBS -N Job_101  -Job name – this is what you will see when you run commands to check the status of your job
#PBS -l nodes=1:ppn=2 – This is requesting 1 node and 2 processor per node. Only request more if your job multi-threads. 1 node has a max of 24 processors, so say you wanted 40 cores you would specify ‘nodes=2:ppn=20’
#PBS -l walltime=02:00:00 – Maximum run time of your job, in hours:minutes:seconds. Example requests 2 hours. Job is killed when it reaches wall time so make sure you request enough.
#PBS -l pmem=4gb – RAM per processor
#PBS -j oe – Send the ‘standard error’ and ‘standard output’ log files to one file (delete this line to keep them separate)
#PBS -M your.email@sydney.edu.au – email job reports to this email address
#PBS -m abe – along with the above directive, this asks for an email when the job aborts (a), begins (b), and ends (e). ‘e’ is the most important as it will give you a report of resources used which can help you decide how much resources to request for the next run.

# Load modules
module load trinity – job will use the software trinity
module load trimmomatic – job will use the software trimmomatic

# Working directory: – if all your input and output is in the same directory, if you don’t have all the data in the same directory it can be useful to have this ‘cd’ command in the script -so you can run the script from any directory without specifying full pathnames for the input and output files.

# Run trinity: – insert commands to run the job…

I will provide some of the actual pbs scripts I have used as I go.

Back to the zipped raw data

Once you have your data on the hpc you can unzip using the unix command ‘gunzip’ as these are gzipped files (gz):

gunzip ‘filename’

eg. gunzip AP0_CGATGT_L001_R1_001.fastq.gz

Your original .gz file will be gone from the directory and replaced with the unzipped version.

With your files unzipped you can now join them with the unix command cat and, in my example, rename the new combined file AP0_R1.fastq:

cat AP0_CGATGT_L001_R1_001.fastq AP0_CGATGT_L002_R1_001.fastq > AP0_R1.fastq

When you have all your data combined like this you can zip them again and begin processing them. Much of the software for assembly can process zipped files. To zip them:

gzip ‘filename’

Next post – trimming the reads