All posts by peritob

Later versions of Canu assembler are WAY better.

After struggling with Canu version 1.6 to assemble a highly repetitive and large (1 Gbp) fungal genome in 2018 (doi: – (PBS-Pro grid options), I was expecting a similar process when I began a new fungal assembly recently.

I decided to test the latest version (2.0) with an 800 Mbp genome and 200X PacBio coverage. Using the same script that was used for assembling a dihaploid genome for myrtle rust (which took 4.5 months) this latest assembly took just over a week! Very excited to have such a quick turnaround.

The output looks promising with good busco results of over 94% complete gene models. Apart from the canu version and species difference another variable was the coverage available. Previously we had 70X starting coverage.With this success I decided to re-run the myrtle rust genome assembly and this time it took a week and the statistics are very similar to the original output.

I did not have to go into the shell scripts to resubmit jobs, as I had to for the first assembly. This time it ran relatively cleanly from the beginning to end.




Canu unitigging/1-overlapper

Using Canu v1.6 this stage has taken 2 months so far with 351 jobs. There are 28 jobs still to run on the University of Sydney HPC, Artemis. Each job requires around 10GB memory, 4 cpus, and a progressive amount of walltime. Starting from only about 5 hours to about 150 hours of walltime for the last 100 jobs.

I have detailed this previously but the genome is predicted to be 1.3 or 1.4 Gbp and we started with 50-60X coverage of PacBio Sequel and RSII data.

The trimmedReads.fasta.gz output was 12G

canu script (NB the genome size was set to 1G for the script):

canu -p MR_1805a \
-d /scratch/directory/run-MR_1805a \
gnuplot=”/usr/local/gnuplot/5.0.0/bin/gnuplot” \
genomeSize=1g \
correctedErrorRate=0.040 \
batOptions=”-dg 3 -db 3 -dr 1 -ca 500 -cp 50″ \

A previous assembly was begun 6 months ago using the same raw data and is still incomplete. The same script was run except for an additional  parameter:


The trimmedReads.fasta.gz output from that assembly run was 17G



Making blast databases from fasta

makeblastdb -dbtype prot -in sequence_name.fasta

This script will return three files:



You can then blastx your raw DNA reads against this database to pull out matches (with a custom script):

/project/directory/blast/ -1 RawReads_R1.fq -2 RawReads_R2.fq -b blastx -d /project/directory/blast/sequence_name.fasta

– before extracting matches based on significance (e-value, length, score). The extracted reads are then used to assemble the gene of interest ‘de novo’.

However, I recently tried running a script for blastx against newly made db files and kept returning the error:

BLAST Database error: No alias or index file found for protein database [/project/directory/sequence.fasta] in search path [/project/directory/::]

I was confused as this has been working up previously. After much wasted time I realised that I had changed my script to include the database name to:

makeblastdb -dbtype prot -in sequence_name.fasta -out sequence_name

This output db files named



NB: The name of the database files did not include the .fasta

Hence blastx could not find my files.


A brief update on resources required for long read assembly

Canu v1.6 resources required for 1.4Gb genome assembly

Currently running an assembly with ~60X coverage based on the RSII and Sequel long read data.

Raw data is first converted to fasta and quiver/arrow using Dextract (Dazzler suite). Quiver is a type of fastq but not sure if it can be used as input to canu or Marvel assemblers. I used the .fasta files for assembly with canu.
Steps in running Canu

  • Submit inital canu script to grid environment  -this step runs Meryl – which runs a kmer analysis, outputs histograms of read length and coverage -also auto submits scripts for batch jobs to the grid for correction stage.
  • After correction, canu auto submits a trimming batch job. This stage takes many resources and time!!!

-for the current job: 917 jobs within the array (these can be done in parrallel). Each job requires ~ 4 CPU, up to 16 gb ram and 60 – 70 hours walltime. The resources are difficult to schedule accurately because the first 10  or so jobs are often not resource hungry and will only take an hour or so. The resources required scale up dramatically and the last two hundred jobs can take as long as 80+ hours each.

  • After trimming, canu autosubmits a unitigging batch job. This stage also takes many resources and time!!! Same resources as correction.
  • Output a fasta assembly and statistics on the assembly.

If all goes well canu will run through all these process without interruption. However, my experience is that the resources need constant monitoring. Also, there is a bug in the shell script that it autosubmits that means if one job in the array fails, it cannot pick up from where is stopped. So individual jobs have to be re-run manually.
So to run the assembly on Canu you ideally need:
(900 X 4)  3600 dedicated CPUs(900 X 16) 14400 GB memory
With these UNREALISTIC resources you could have a genome in walltime: two and a half weeks. Realistically you will need to run jobs within the array in batches according to your resources available. To be clear here, Canu does a good job of the assembly and I applaud the authors of this software!! It is a clearly a very difficult problem to optimise the assembly process with our compute resources. I hope that teams are  working on this very difficult issue or perhaps we have to hope that quantum computing is just around the corner…
The first assemblies I ran with Canu with half the coverage (30X) took aproximately 2 months. These assemblies autosubmitted batch jobs of around 300 for the trimming and unitigging stages.
The following grid options more or less worked for the previous assemblies. The walltime needed some adjustment as the array job progressed.
gridOptionsCORMHAP=”-l walltime=16:00:00″ \

gridOptionsCOROVL=”-l walltime=16:00:00″ \

gridOptionsCOR=”-l walltime=16:00:00″ \

gridOptionsOBTOVL=”-l walltime=56:00:00 -l mem=16g -l nodes=1:ppn=5″ \

gridOptionsUTGOVL=”-l walltime=56:00:00 -l mem=16g -l nodes=1:ppn=5″ \

gridOptionsMERYL=”-l walltime=4:00:00″
PS. Marvel assembler is not looking very promising at this stage, although they claim to reduce resource requirement by an order of magnitude by softmasking prior to local alignment step. I may be able to make some adjustments to my script to help but currently, after 336 hours walltime 8 CPUs and 70 gb ram, the Daligner stage has only produced alignments (.las) for 29 out of 363 files.

Moving to Marvel

Well, Canu seems to do a reasonable job of assembling long reads for larger organisms but it is laboriously slow!! Trimming and Unitigging stages can take several weeks each. I am now using the reduced list of reads (after mapping to the mitochondrial genome – which reduced reads by ~4%) with the Marvel assembler.

I have been working through the scripts as per the Schmidtea mediterranea example from this recent Nature publication by Grohme et al 2018.

So far so good. However the claim that by using repeat soft masking will reduce computational time by an order of magnitude is yet to be determined. Fingers crossed.

Using bbmap to extract reads – is this the best approach?

I have a draft mitochondrial genome from my organism. I now have a lot more PacBio data to assemble. I want to extract mitochondrial reads before assembling genomic reads in the hope that this will make the Canu assembly quicker.

Decided to try using bbmap.

#PBS -project name
#PBS -N bbmap
#PBS -l select=1:ncpus=2:mem=64GB
#PBS -l walltime=24:00:00
#PBS -e ./bbmap_error.txt
#PBS -o ./bbmap_output.txt
#PBS -m abe

BBMAP_OUT_DIR=/scratch/project name/bbmap

module load bbmap
module load java
module load samtools


#bbmap_script in=174152_1.fasta outu=$BBMAP_OUT_DIR/unmapped_174152_1.fasta outm=$BBMAP_OUT_DIR/mapped_174152_1.fasta maxlen=6000

This has a facility to map long reads to your genome but it splits them into ~6000 bases and renames them something like:



It then outputs the mapped and unmapped reads as fasta files. My thoughts were to use the unmapped reads for assembly. However there are two issues.

  1. The parts need to be rejoined.
  2. bbmap seems to bin the whole ‘part’ based on mapping to the genome. ie. if only a portion of the 6000 bases map to the mitochondrial genome it puts it into the mapped bin.

Now wondering if this was the correct approach. There is new software from S. Koren (Canu author) which maps PacBio data.

I had some trepidation using this as it is only newly developed and may be poorly supported, however it is likely to be more suitable.

Further thoughts:

If any part of the read maps to the mitochondrial genome – the read should be derived from the mitochondrial genome. Therefore it is correct that the whole section is binned as mapped. My plan is:

  1. Extract headers from the mapped output.
  2. Use header list to extract reads from original read files.
  3. Use the reads (minus extracted mapped reads) to assemble the genome.


Polishing Canu genome

Having assembled the genome using Canu it is time to polish. This can be done with Pilon, using Illumina reads aligned to the genome. This can also be a time and compute heavy job (days). For example:

PBS -l select=1:ncpus=8:mem=160gb
PBS -l walltime=64:00:00

Now doing the polishing again. This time aligned the raw subread.bam files with the genome using pbalign from SMRT analysis tools. A good explanation was found here after much confusion.

The command to make an xml dataset (of all the subread.bam files for pbalign is this below. Note the subreads.bam files are from the sequel data (one folder) and RSII data (another folder):

~/bin/smrtlink/smrtcmds/bin/dataset create –type SubreadSet –name fungi  fungi_set.xml ../RSII/*.subreads.bam ../sequel/*.subreads.bam

The main confusing bit is the name ‘bam’. Here we have pre-aligned raw bams (subread.bams). Then, after aligning using pbalign, we have aligned bams. These aligned reads are the ones to use in the polishing with arrow from SMRT-tools.



Promising Canu output

After nearly 2 and a half months of tweaking, restarting we have an assembled genome. It came in at 1.37 G assembled bases, 19771 contigs (a bit high I guess but with a coverage of around 21X after trimming and correction not too bad) and an NG50 of 134 701.

Busco analysis results were also not too bad for a first assembly using 1335 conserved taxa genes.

1024 Complete BUSCOs (C)
640 Complete and single-copy BUSCOs (S)
384 Complete and duplicated BUSCOs (D)
101 Fragmented BUSCOs (F)
210 Missing BUSCOs (M)
1335 Total BUSCO groups searched
BUSCO analysis done. Total running time: 5979.202003479004 seconds


I think that the duplicated genes, and maybe the fragmented ones too, is due to heterozygosity. This will have inflated the genome size too. This means that with revised parameters for assembly – to phase the haplotypes – we might get a more realistic assembly.

More work to do but I have heard a few bad comments about Canu and do not think that they are warranted. Time for the assembly was approximately 3 weeks once we ironed out the grid options for different stages.

K-mer analysis and genome size estimation, bridges and Euler

Determining the approximate genome size before you begin expensive PacBio sequencing projects is important.

If Illumina reads are available for the organism, a kmer analysis can estimate the genome size.

There is an excellent tutorial from the University of Connecticut explaining the step by step process:

The seven bridges of Königsberg problem, and Euler’s solution (1736), is the root of the assembly algorithms for short reads. De Bruijn graph theory developed from Euler’s solution that every land mass is a ‘node’ and every bridge is an ‘edge’.

In assembling a genome using de Bruijn graph algorithms, the edges are kmers and the nodes are alignments. Explained very nicely here:

Click to access DeBruijnPrimer.pdf