Canu unitigging/1-overlapper

Using Canu v1.6 this stage has taken 2 months so far with 351 jobs. There are 28 jobs still to run on the University of Sydney HPC, Artemis. Each job requires around 10GB memory, 4 cpus, and a progressive amount of walltime. Starting from only about 5 hours to about 150 hours of walltime for the last 100 jobs.

I have detailed this previously but the genome is predicted to be 1.3 or 1.4 Gbp and we started with 50-60X coverage of PacBio Sequel and RSII data.

The trimmedReads.fasta.gz output was 12G

canu script (NB the genome size was set to 1G for the script):

canu -p MR_1805a \
-d /scratch/directory/run-MR_1805a \
gnuplot=”/usr/local/gnuplot/5.0.0/bin/gnuplot” \
genomeSize=1g \
correctedErrorRate=0.040 \
batOptions=”-dg 3 -db 3 -dr 1 -ca 500 -cp 50″ \

A previous assembly was begun 6 months ago using the same raw data and is still incomplete. The same script was run except for an additional  parameter:

corOutCoverage=200

The trimmedReads.fasta.gz output from that assembly run was 17G

 

 

Advertisements

Making blast databases from fasta

makeblastdb -dbtype prot -in sequence_name.fasta

This script will return three files:

sequence_name.fasta.phr

sequence_name.fasta.pin

sequence_name.fasta.psq

You can then blastx your raw DNA reads against this database to pull out matches (with a custom script):

/project/directory/blast/launch_blast.py -1 RawReads_R1.fq -2 RawReads_R2.fq -b blastx -d /project/directory/blast/sequence_name.fasta

– before extracting matches based on significance (e-value, length, score). The extracted reads are then used to assemble the gene of interest ‘de novo’.

However, I recently tried running a script for blastx against newly made db files and kept returning the error:

BLAST Database error: No alias or index file found for protein database [/project/directory/sequence.fasta] in search path [/project/directory/::]

I was confused as this has been working up previously. After much wasted time I realised that I had changed my script to include the database name to:

makeblastdb -dbtype prot -in sequence_name.fasta -out sequence_name

This output db files named

sequence_name.phr

sequence_name.pin

sequence_name.psq

NB: The name of the database files did not include the .fasta

Hence blastx could not find my files.

 

A brief update on resources required for long read assembly

Canu v1.6 resources required for 1.4Gb genome assembly

Currently running an assembly with ~60X coverage based on the RSII and Sequel long read data.

Raw data is first converted to fasta and quiver/arrow using Dextract (Dazzler suite). Quiver is a type of fastq but not sure if it can be used as input to canu or Marvel assemblers. I used the .fasta files for assembly with canu.
Steps in running Canu

  • Submit inital canu script to grid environment  -this step runs Meryl – which runs a kmer analysis, outputs histograms of read length and coverage -also auto submits scripts for batch jobs to the grid for correction stage.
  • After correction, canu auto submits a trimming batch job. This stage takes many resources and time!!!

-for the current job: 917 jobs within the array (these can be done in parrallel). Each job requires ~ 4 CPU, up to 16 gb ram and 60 – 70 hours walltime. The resources are difficult to schedule accurately because the first 10  or so jobs are often not resource hungry and will only take an hour or so. The resources required scale up dramatically and the last two hundred jobs can take as long as 80+ hours each.

  • After trimming, canu autosubmits a unitigging batch job. This stage also takes many resources and time!!! Same resources as correction.
  • Output a fasta assembly and statistics on the assembly.

If all goes well canu will run through all these process without interruption. However, my experience is that the resources need constant monitoring. Also, there is a bug in the shell script that it autosubmits that means if one job in the array fails, it cannot pick up from where is stopped. So individual jobs have to be re-run manually.
So to run the assembly on Canu you ideally need:
(900 X 4)  3600 dedicated CPUs(900 X 16) 14400 GB memory
With these UNREALISTIC resources you could have a genome in walltime: two and a half weeks. Realistically you will need to run jobs within the array in batches according to your resources available. To be clear here, Canu does a good job of the assembly and I applaud the authors of this software!! It is a clearly a very difficult problem to optimise the assembly process with our compute resources. I hope that teams are  working on this very difficult issue or perhaps we have to hope that quantum computing is just around the corner…
The first assemblies I ran with Canu with half the coverage (30X) took aproximately 2 months. These assemblies autosubmitted batch jobs of around 300 for the trimming and unitigging stages.
The following grid options more or less worked for the previous assemblies. The walltime needed some adjustment as the array job progressed.
gridOptionsCORMHAP=”-l walltime=16:00:00″ \

gridOptionsCOROVL=”-l walltime=16:00:00″ \

gridOptionsCOR=”-l walltime=16:00:00″ \

gridOptionsOBTOVL=”-l walltime=56:00:00 -l mem=16g -l nodes=1:ppn=5″ \

gridOptionsUTGOVL=”-l walltime=56:00:00 -l mem=16g -l nodes=1:ppn=5″ \

gridOptionsMERYL=”-l walltime=4:00:00″
PS. Marvel assembler is not looking very promising at this stage, although they claim to reduce resource requirement by an order of magnitude by softmasking prior to local alignment step. I may be able to make some adjustments to my script to help but currently, after 336 hours walltime 8 CPUs and 70 gb ram, the Daligner stage has only produced alignments (.las) for 29 out of 363 files.

Moving to Marvel

Well, Canu seems to do a reasonable job of assembling long reads for larger organisms but it is laboriously slow!! Trimming and Unitigging stages can take several weeks each. I am now using the reduced list of reads (after mapping to the mitochondrial genome – which reduced reads by ~4%) with the Marvel assembler.

https://github.com/schloi/MARVEL

I have been working through the scripts as per the Schmidtea mediterranea example from this recent Nature publication by Grohme et al 2018.

https://www.nature.com/articles/nature25473

So far so good. However the claim that by using repeat soft masking will reduce computational time by an order of magnitude is yet to be determined. Fingers crossed.

Using bbmap to extract reads – is this the best approach?

I have a draft mitochondrial genome from my organism. I now have a lot more PacBio data to assemble. I want to extract mitochondrial reads before assembling genomic reads in the hope that this will make the Canu assembly quicker.

Decided to try using bbmap.

#!/bin/bash
#PBS -project name
#PBS -N bbmap
#PBS -l select=1:ncpus=2:mem=64GB
#PBS -l walltime=24:00:00
#PBS -e ./bbmap_error.txt
#PBS -o ./bbmap_output.txt
#PBS -M name@sydney.edu.au
#PBS -m abe

BBMAP_OUT_DIR=/scratch/project name/bbmap

module load bbmap
module load java
module load samtools

cd $PBS_O_WORKDIR

#bbmap_script
mapPacBio.sh in=174152_1.fasta outu=$BBMAP_OUT_DIR/unmapped_174152_1.fasta outm=$BBMAP_OUT_DIR/mapped_174152_1.fasta maxlen=6000

This has a facility to map long reads to your genome but it splits them into ~6000 bases and renames them something like:

<read_abc_part_1

<read_abc_part_2

It then outputs the mapped and unmapped reads as fasta files. My thoughts were to use the unmapped reads for assembly. However there are two issues.

  1. The parts need to be rejoined.
  2. bbmap seems to bin the whole ‘part’ based on mapping to the genome. ie. if only a portion of the 6000 bases map to the mitochondrial genome it puts it into the mapped bin.

Now wondering if this was the correct approach. There is new software from S. Koren (Canu author) which maps PacBio data.

https://www.biorxiv.org/content/early/2018/02/05/259986

https://github.com/marbl/MashMap

I had some trepidation using this as it is only newly developed and may be poorly supported, however it is likely to be more suitable.

Further thoughts:

If any part of the read maps to the mitochondrial genome – the read should be derived from the mitochondrial genome. Therefore it is correct that the whole section is binned as mapped. My plan is:

  1. Extract headers from the mapped output.
  2. Use header list to extract reads from original read files.
  3. Use the reads (minus extracted mapped reads) to assemble the genome.

 

Polishing Canu genome

Having assembled the genome using Canu it is time to polish. This can be done with Pilon, using Illumina reads aligned to the genome. This can also be a time and compute heavy job (days). For example:

PBS -l select=1:ncpus=8:mem=160gb
PBS -l walltime=64:00:00

Now doing the polishing again. This time aligned the raw subread.bam files with the genome using pbalign from SMRT analysis tools. A good explanation was found here after much confusion.

https://www.biostars.org/p/273447/

The command to make an xml dataset (of all the subread.bam files for pbalign is this below. Note the subreads.bam files are from the sequel data (one folder) and RSII data (another folder):

~/bin/smrtlink/smrtcmds/bin/dataset create –type SubreadSet –name fungi  fungi_set.xml ../RSII/*.subreads.bam ../sequel/*.subreads.bam

The main confusing bit is the name ‘bam’. Here we have pre-aligned raw bams (subread.bams). Then, after aligning using pbalign, we have aligned bams. These aligned reads are the ones to use in the polishing with arrow from SMRT-tools.

http://www.pacb.com/support/software-downloads/