Category Archives: Assembly

A brief update on resources required for long read assembly

Canu v1.6 resources required for 1.4Gb genome assembly

Currently running an assembly with ~60X coverage based on the RSII and Sequel long read data.

Raw data is first converted to fasta and quiver/arrow using Dextract (Dazzler suite). Quiver is a type of fastq but not sure if it can be used as input to canu or Marvel assemblers. I used the .fasta files for assembly with canu.
Steps in running Canu

  • Submit inital canu script to grid environment  -this step runs Meryl – which runs a kmer analysis, outputs histograms of read length and coverage -also auto submits scripts for batch jobs to the grid for correction stage.
  • After correction, canu auto submits a trimming batch job. This stage takes many resources and time!!!

-for the current job: 917 jobs within the array (these can be done in parrallel). Each job requires ~ 4 CPU, up to 16 gb ram and 60 – 70 hours walltime. The resources are difficult to schedule accurately because the first 10  or so jobs are often not resource hungry and will only take an hour or so. The resources required scale up dramatically and the last two hundred jobs can take as long as 80+ hours each.

  • After trimming, canu autosubmits a unitigging batch job. This stage also takes many resources and time!!! Same resources as correction.
  • Output a fasta assembly and statistics on the assembly.

If all goes well canu will run through all these process without interruption. However, my experience is that the resources need constant monitoring. Also, there is a bug in the shell script that it autosubmits that means if one job in the array fails, it cannot pick up from where is stopped. So individual jobs have to be re-run manually.
So to run the assembly on Canu you ideally need:
(900 X 4)  3600 dedicated CPUs(900 X 16) 14400 GB memory
With these UNREALISTIC resources you could have a genome in walltime: two and a half weeks. Realistically you will need to run jobs within the array in batches according to your resources available. To be clear here, Canu does a good job of the assembly and I applaud the authors of this software!! It is a clearly a very difficult problem to optimise the assembly process with our compute resources. I hope that teams are  working on this very difficult issue or perhaps we have to hope that quantum computing is just around the corner…
The first assemblies I ran with Canu with half the coverage (30X) took aproximately 2 months. These assemblies autosubmitted batch jobs of around 300 for the trimming and unitigging stages.
The following grid options more or less worked for the previous assemblies. The walltime needed some adjustment as the array job progressed.
gridOptionsCORMHAP=”-l walltime=16:00:00″ \

gridOptionsCOROVL=”-l walltime=16:00:00″ \

gridOptionsCOR=”-l walltime=16:00:00″ \

gridOptionsOBTOVL=”-l walltime=56:00:00 -l mem=16g -l nodes=1:ppn=5″ \

gridOptionsUTGOVL=”-l walltime=56:00:00 -l mem=16g -l nodes=1:ppn=5″ \

gridOptionsMERYL=”-l walltime=4:00:00″
PS. Marvel assembler is not looking very promising at this stage, although they claim to reduce resource requirement by an order of magnitude by softmasking prior to local alignment step. I may be able to make some adjustments to my script to help but currently, after 336 hours walltime 8 CPUs and 70 gb ram, the Daligner stage has only produced alignments (.las) for 29 out of 363 files.

Further canu developments

While tweaking the memory and wall-time for the compute cluster Mhap job (correction and overlapping algorithm) progressed things slightly – the job still failed inexplicably.

After many trials it appears that the actual project directory did not have the disk space capacity for the very large (several terabytes) intermediate files created by Mhap. It seems a pity that neither the HPC nor the software alerted us to this issue – hence many wasted attempts and several weeks of trying.

We then ran the job within scratch directory, which has around 200T disk space available, and finally the Mhap job completed successfully and the next stage began. This is the other correction and trimming algorithm of Canu and is called Ovl (returns alignments). Now trying to work out why this section only partially completes. We have upped the memory on this stage to 10G from the default 5G and it is now running – so I guess we will see in a few days if it worked.

#PBS -P project directory
#PBS -N canu
#PBS -l select=1:ncpus=2:mem=64GB
#PBS -l walltime=00:30:00
#PBS -e ./canu_error.txt
#PBS -o ./canu_output.txt
#PBS -m abe

# Load modules
module load gnuplot
module load perl
module load java
module load canu

# Working directory

canu -p name -d /scratch/directory/run-1510 genomeSize=1000m gridOptions=”-P project directory” gridOptionsCORMHAP=”-l mem=60g -l walltime=16:00:00″ gridOptionsCOROVL=”-l mem=10g” gnuplot=”/usr/local/gnuplot/5.0.0/bin/gnuplot” -pacbio-raw ../RSII/*.fasta ../sequel/*.fasta

False starts with PacBio reads

As mentioned previously, the raw reads from RSII and Sequel PacBio come as bax.h5 and subreads.bam respectively. Both the Falcon and Canu assembler take input files of Fasta or Fastq.

After reading a bit more we realized that we needed to extract the fasta as well as the arrow files for assembly and then polishing. Dextract from the Dazzler suite is the software to extract and make these files from the bax.h5 and subreads.bam.

The following explanation is what happened when we attempted to install dextract via the Falcon-integrate onto the hpc.

Found DEXTRACTOR was not on the HPC – thought it may have been installed as part of FALCON, which is there, but this was not the case. We then decided to install FALCON-integrate which includes DEXTRACTOR. We followed the intructions here:
export GIT_SYM_CACHE_DIR=~/.git-sym-cache # to speed things up
cd FALCON-integrate
git checkout master  # or whatever version you want
git submodule update –init # Note: You must do this yourself! No longer via `make init`.
make init
make config-edit-user
make -j all
make test  # to run a simple one
Needed to run “module load python” to get the correct python version (2.7.9.)
Found that dextract did not build. This was fixed by changing the hdf5 include and lib path in the makefile (DEXTRACTOR/Makefile)
Then found FALCON-integrate doesn’t have the later version of DEXTRACTOR which can work with .bam files, so we decided to do a stand alone install of the latest DEXTRACTOR.
git checkout master
make -f Makefile
The makefile needed to be edited to work with different zlib and hdf5 include and lib paths
(possibly could have avoided this by running “module load hdf5/1.8.16” and “module load zlib” beforehand)
Running dextract
Need to run
module load hdf5/1.8.16
module load zlib
~/bin/DEXTRACTOR/dextract -f -a m54078_170627_071131.subreads.bam
generates a m54078_170627_071131.fasta and a m54078_170627_071131.arrow in the current directory
-f outputs a fasta file
-a outputs an arrow file

Starting the process of genome assembly with long reads

I have struggled to find a clean pipeline for de novo assembling long read data from large eukaryotes. I have reads from PacBio RS II as well as Sequel. RS II read files are .h5 and sequel files are .bam. An explanation of the bam format is here, it is not just an alignment format:

At this stage I am planning to convert .h5 files to .bam and then use Falcon to assemble (hoping the coverge is enough) and then map short reads (Illumina) to the contigs – perhaps with Pilon? Early days so we will see what eventuates along the way. I will post script as I go.


Things to do better next time

I have had some errors found in my assembled transcriptomes when they were submitted to NCBI. It appears that I missed a couple of quality control steps at the trimming stage.

What I have learnt:

  • after using the default trimmomatic settings – check the data before proceeding.
  • incorporate a file with all Illumina primers for identifying and trimming these from reads.
  • use bbduk to help with this process – as other residual primers may be there – from previous runs on the machine (
  • screen the raw data for contaminants, eg. bacterial, human and other non-target sequences. Seems my transcriptomes included a  few sequences from humans, gorilla and chimp!!
  • None of this really affected my analysis and, in fact, was such a small proportion of the assembled transcriptome BUT – the assemblies are now a bit questionable and should be done again.