After struggling with Canu version 1.6 to assemble a highly repetitive and large (1 Gbp) fungal genome in 2018 (doi:https://doi.org/10.1101/2020.03.18.996108) – (PBS-Pro grid options), I was expecting a similar process when I began a new fungal assembly recently.
I decided to test the latest version (2.0) with an 800 Mbp genome and 200X PacBio coverage. Using the same script that was used for assembling a dihaploid genome for myrtle rust (which took 4.5 months) this latest assembly took just over a week! Very excited to have such a quick turnaround.
The output looks promising with good busco results of over 94% complete gene models. Apart from the canu version and species difference another variable was the coverage available. Previously we had 70X starting coverage.With this success I decided to re-run the myrtle rust genome assembly and this time it took a week and the statistics are very similar to the original output.
I did not have to go into the shell scripts to resubmit jobs, as I had to for the first assembly. This time it ran relatively cleanly from the beginning to end.
Using Canu v1.6 this stage has taken 2 months so far with 351 jobs. There are 28 jobs still to run on the University of Sydney HPC, Artemis. Each job requires around 10GB memory, 4 cpus, and a progressive amount of walltime. Starting from only about 5 hours to about 150 hours of walltime for the last 100 jobs.
I have detailed this previously but the genome is predicted to be 1.3 or 1.4 Gbp and we started with 50-60X coverage of PacBio Sequel and RSII data.
The trimmedReads.fasta.gz output was 12G
canu script (NB the genome size was set to 1G for the script):
Canu v1.6 resources required for 1.4Gb genome assembly
Currently running an assembly with ~60X coverage based on the RSII and Sequel long read data.
Raw data is first converted to fasta and quiver/arrow using Dextract (Dazzler suite). Quiver is a type of fastq but not sure if it can be used as input to canu or Marvel assemblers. I used the .fasta files for assembly with canu.
Steps in running Canu
Submit inital canu script to grid environment -this step runs Meryl – which runs a kmer analysis, outputs histograms of read length and coverage -also auto submits scripts for batch jobs to the grid for correction stage.
After correction, canu auto submits a trimming batch job. This stage takes many resources and time!!!
-for the current job: 917 jobs within the array (these can be done in parrallel). Each job requires ~ 4 CPU, up to 16 gb ram and 60 – 70 hours walltime. The resources are difficult to schedule accurately because the first 10 or so jobs are often not resource hungry and will only take an hour or so. The resources required scale up dramatically and the last two hundred jobs can take as long as 80+ hours each.
After trimming, canu autosubmits a unitigging batch job. This stage also takes many resources and time!!! Same resources as correction.
Output a fasta assembly and statistics on the assembly.
If all goes well canu will run through all these process without interruption. However, my experience is that the resources need constant monitoring. Also, there is a bug in the shell script that it autosubmits that means if one job in the array fails, it cannot pick up from where is stopped. So individual jobs have to be re-run manually.
So to run the assembly on Canu you ideally need:
(900 X 4) 3600 dedicated CPUs(900 X 16) 14400 GB memory
With these UNREALISTIC resources you could have a genome in walltime: two and a half weeks. Realistically you will need to run jobs within the array in batches according to your resources available. To be clear here, Canu does a good job of the assembly and I applaud the authors of this software!! It is a clearly a very difficult problem to optimise the assembly process with our compute resources. I hope that teams are working on this very difficult issue or perhaps we have to hope that quantum computing is just around the corner…
The first assemblies I ran with Canu with half the coverage (30X) took aproximately 2 months. These assemblies autosubmitted batch jobs of around 300 for the trimming and unitigging stages.
The following grid options more or less worked for the previous assemblies. The walltime needed some adjustment as the array job progressed.
gridOptionsCORMHAP=”-l walltime=16:00:00″ \
PS. Marvel assembler is not looking very promising at this stage, although they claim to reduce resource requirement by an order of magnitude by softmasking prior to local alignment step. I may be able to make some adjustments to my script to help but currently, after 336 hours walltime 8 CPUs and 70 gb ram, the Daligner stage has only produced alignments (.las) for 29 out of 363 files.
Well, Canu seems to do a reasonable job of assembling long reads for larger organisms but it is laboriously slow!! Trimming and Unitigging stages can take several weeks each. I am now using the reduced list of reads (after mapping to the mitochondrial genome – which reduced reads by ~4%) with the Marvel assembler.
I have a draft mitochondrial genome from my organism. I now have a lot more PacBio data to assemble. I want to extract mitochondrial reads before assembling genomic reads in the hope that this will make the Canu assembly quicker.
The main confusing bit is the name ‘bam’. Here we have pre-aligned raw bams (subread.bams). Then, after aligning using pbalign, we have aligned bams. These aligned reads are the ones to use in the polishing with arrow from SMRT-tools.
After nearly 2 and a half months of tweaking, restarting we have an assembled genome. It came in at 1.37 G assembled bases, 19771 contigs (a bit high I guess but with a coverage of around 21X after trimming and correction not too bad) and an NG50 of 134 701.
Busco analysis results were also not too bad for a first assembly using 1335 conserved taxa genes.
1024 Complete BUSCOs (C)
640 Complete and single-copy BUSCOs (S)
384 Complete and duplicated BUSCOs (D)
101 Fragmented BUSCOs (F)
210 Missing BUSCOs (M)
1335 Total BUSCO groups searched
BUSCO analysis done. Total running time: 5979.202003479004 seconds
I think that the duplicated genes, and maybe the fragmented ones too, is due to heterozygosity. This will have inflated the genome size too. This means that with revised parameters for assembly – to phase the haplotypes – we might get a more realistic assembly.
More work to do but I have heard a few bad comments about Canu and do not think that they are warranted. Time for the assembly was approximately 3 weeks once we ironed out the grid options for different stages.
The seven bridges of Königsberg problem, and Euler’s solution (1736), is the root of the assembly algorithms for short reads. De Bruijn graph theory developed from Euler’s solution that every land mass is a ‘node’ and every bridge is an ‘edge’.