The main confusing bit is the name ‘bam’. Here we have pre-aligned raw bams (subread.bams). Then, after aligning using pbalign, we have aligned bams. These aligned reads are the ones to use in the polishing with arrow from SMRT-tools.
After nearly 2 and a half months of tweaking, restarting we have an assembled genome. It came in at 1.37 G assembled bases, 19771 contigs (a bit high I guess but with a coverage of around 21X after trimming and correction not too bad) and an NG50 of 134 701.
Busco analysis results were also not too bad for a first assembly using 1335 conserved taxa genes.
1024 Complete BUSCOs (C)
640 Complete and single-copy BUSCOs (S)
384 Complete and duplicated BUSCOs (D)
101 Fragmented BUSCOs (F)
210 Missing BUSCOs (M)
1335 Total BUSCO groups searched
BUSCO analysis done. Total running time: 5979.202003479004 seconds
I think that the duplicated genes, and maybe the fragmented ones too, is due to heterozygosity. This will have inflated the genome size too. This means that with revised parameters for assembly – to phase the haplotypes – we might get a more realistic assembly.
More work to do but I have heard a few bad comments about Canu and do not think that they are warranted. Time for the assembly was approximately 3 weeks once we ironed out the grid options for different stages.
The seven bridges of Königsberg problem, and Euler’s solution (1736), is the root of the assembly algorithms for short reads. De Bruijn graph theory developed from Euler’s solution that every land mass is a ‘node’ and every bridge is an ‘edge’.
While tweaking the memory and wall-time for the compute cluster Mhap job (correction and overlapping algorithm) progressed things slightly – the job still failed inexplicably.
After many trials it appears that the actual project directory did not have the disk space capacity for the very large (several terabytes) intermediate files created by Mhap. It seems a pity that neither the HPC nor the software alerted us to this issue – hence many wasted attempts and several weeks of trying.
We then ran the job within scratch directory, which has around 200T disk space available, and finally the Mhap job completed successfully and the next stage began. This is the other correction and trimming algorithm of Canu and is called Ovl (returns alignments). Now trying to work out why this section only partially completes. We have upped the memory on this stage to 10G from the default 5G and it is now running – so I guess we will see in a few days if it worked.
As mentioned previously, the raw reads from RSII and Sequel PacBio come as bax.h5 and subreads.bam respectively. Both the Falcon and Canu assembler take input files of Fasta or Fastq.
After reading a bit more we realized that we needed to extract the fasta as well as the arrow files for assembly and then polishing. Dextract from the Dazzler suite is the software to extract and make these files from the bax.h5 and subreads.bam.
The following explanation is what happened when we attempted to install dextract via the Falcon-integrate onto the hpc.
Found DEXTRACTOR was not on the HPC – thought it may have been installed as part of FALCON, which is there, but this was not the case. We then decided to install FALCON-integrate which includes DEXTRACTOR. We followed the intructions here: