A brief update on resources required for long read assembly

Canu v1.6 resources required for 1.4Gb genome assembly

Currently running an assembly with ~60X coverage based on the RSII and Sequel long read data.

Raw data is first converted to fasta and quiver/arrow using Dextract (Dazzler suite). Quiver is a type of fastq but not sure if it can be used as input to canu or Marvel assemblers. I used the .fasta files for assembly with canu.
Steps in running Canu

  • Submit inital canu script to grid environment  -this step runs Meryl – which runs a kmer analysis, outputs histograms of read length and coverage -also auto submits scripts for batch jobs to the grid for correction stage.
  • After correction, canu auto submits a trimming batch job. This stage takes many resources and time!!!

-for the current job: 917 jobs within the array (these can be done in parrallel). Each job requires ~ 4 CPU, up to 16 gb ram and 60 – 70 hours walltime. The resources are difficult to schedule accurately because the first 10  or so jobs are often not resource hungry and will only take an hour or so. The resources required scale up dramatically and the last two hundred jobs can take as long as 80+ hours each.

  • After trimming, canu autosubmits a unitigging batch job. This stage also takes many resources and time!!! Same resources as correction.
  • Output a fasta assembly and statistics on the assembly.

If all goes well canu will run through all these process without interruption. However, my experience is that the resources need constant monitoring. Also, there is a bug in the shell script that it autosubmits that means if one job in the array fails, it cannot pick up from where is stopped. So individual jobs have to be re-run manually.
So to run the assembly on Canu you ideally need:
(900 X 4)  3600 dedicated CPUs(900 X 16) 14400 GB memory
With these UNREALISTIC resources you could have a genome in walltime: two and a half weeks. Realistically you will need to run jobs within the array in batches according to your resources available. To be clear here, Canu does a good job of the assembly and I applaud the authors of this software!! It is a clearly a very difficult problem to optimise the assembly process with our compute resources. I hope that teams are  working on this very difficult issue or perhaps we have to hope that quantum computing is just around the corner…
The first assemblies I ran with Canu with half the coverage (30X) took aproximately 2 months. These assemblies autosubmitted batch jobs of around 300 for the trimming and unitigging stages.
The following grid options more or less worked for the previous assemblies. The walltime needed some adjustment as the array job progressed.
gridOptionsCORMHAP=”-l walltime=16:00:00″ \

gridOptionsCOROVL=”-l walltime=16:00:00″ \

gridOptionsCOR=”-l walltime=16:00:00″ \

gridOptionsOBTOVL=”-l walltime=56:00:00 -l mem=16g -l nodes=1:ppn=5″ \

gridOptionsUTGOVL=”-l walltime=56:00:00 -l mem=16g -l nodes=1:ppn=5″ \

gridOptionsMERYL=”-l walltime=4:00:00″
PS. Marvel assembler is not looking very promising at this stage, although they claim to reduce resource requirement by an order of magnitude by softmasking prior to local alignment step. I may be able to make some adjustments to my script to help but currently, after 336 hours walltime 8 CPUs and 70 gb ram, the Daligner stage has only produced alignments (.las) for 29 out of 363 files.