Category Archives: Uncategorized

Later versions of Canu assembler are WAY better.

After struggling with Canu version 1.6 to assemble a highly repetitive and large (1 Gbp) fungal genome in 2018 (doi: https://doi.org/10.1101/2020.03.18.996108) – (PBS-Pro grid options), I was expecting a similar process when I began a new fungal assembly recently.

I decided to test the latest version (2.0) with an 800 Mbp genome and 200X PacBio coverage. Using the same script that was used for assembling a dihaploid genome for myrtle rust (which took 4.5 months) this latest assembly took just over a week! Very excited to have such a quick turnaround.

The output looks promising with good busco results of over 94% complete gene models. Apart from the canu version and species difference another variable was the coverage available. Previously we had 70X starting coverage.With this success I decided to re-run the myrtle rust genome assembly and this time it took a week and the statistics are very similar to the original output.

I did not have to go into the shell scripts to resubmit jobs, as I had to for the first assembly. This time it ran relatively cleanly from the beginning to end.

 

 

 

Canu unitigging/1-overlapper

Using Canu v1.6 this stage has taken 2 months so far with 351 jobs. There are 28 jobs still to run on the University of Sydney HPC, Artemis. Each job requires around 10GB memory, 4 cpus, and a progressive amount of walltime. Starting from only about 5 hours to about 150 hours of walltime for the last 100 jobs.

I have detailed this previously but the genome is predicted to be 1.3 or 1.4 Gbp and we started with 50-60X coverage of PacBio Sequel and RSII data.

The trimmedReads.fasta.gz output was 12G

canu script (NB the genome size was set to 1G for the script):

canu -p MR_1805a \
-d /scratch/directory/run-MR_1805a \
gnuplot=”/usr/local/gnuplot/5.0.0/bin/gnuplot” \
genomeSize=1g \
correctedErrorRate=0.040 \
batOptions=”-dg 3 -db 3 -dr 1 -ca 500 -cp 50″ \

A previous assembly was begun 6 months ago using the same raw data and is still incomplete. The same script was run except for an additional  parameter:

corOutCoverage=200

The trimmedReads.fasta.gz output from that assembly run was 17G

 

 

Making blast databases from fasta

makeblastdb -dbtype prot -in sequence_name.fasta

This script will return three files:

sequence_name.fasta.phr

sequence_name.fasta.pin

sequence_name.fasta.psq

You can then blastx your raw DNA reads against this database to pull out matches (with a custom script):

/project/directory/blast/launch_blast.py -1 RawReads_R1.fq -2 RawReads_R2.fq -b blastx -d /project/directory/blast/sequence_name.fasta

– before extracting matches based on significance (e-value, length, score). The extracted reads are then used to assemble the gene of interest ‘de novo’.

However, I recently tried running a script for blastx against newly made db files and kept returning the error:

BLAST Database error: No alias or index file found for protein database [/project/directory/sequence.fasta] in search path [/project/directory/::]

I was confused as this has been working up previously. After much wasted time I realised that I had changed my script to include the database name to:

makeblastdb -dbtype prot -in sequence_name.fasta -out sequence_name

This output db files named

sequence_name.phr

sequence_name.pin

sequence_name.psq

NB: The name of the database files did not include the .fasta

Hence blastx could not find my files.

 

Moving to Marvel

Well, Canu seems to do a reasonable job of assembling long reads for larger organisms but it is laboriously slow!! Trimming and Unitigging stages can take several weeks each. I am now using the reduced list of reads (after mapping to the mitochondrial genome – which reduced reads by ~4%) with the Marvel assembler.

https://github.com/schloi/MARVEL

I have been working through the scripts as per the Schmidtea mediterranea example from this recent Nature publication by Grohme et al 2018.

https://www.nature.com/articles/nature25473

So far so good. However the claim that by using repeat soft masking will reduce computational time by an order of magnitude is yet to be determined. Fingers crossed.

Using bbmap to extract reads – is this the best approach?

I have a draft mitochondrial genome from my organism. I now have a lot more PacBio data to assemble. I want to extract mitochondrial reads before assembling genomic reads in the hope that this will make the Canu assembly quicker.

Decided to try using bbmap.

#!/bin/bash
#PBS -project name
#PBS -N bbmap
#PBS -l select=1:ncpus=2:mem=64GB
#PBS -l walltime=24:00:00
#PBS -e ./bbmap_error.txt
#PBS -o ./bbmap_output.txt
#PBS -M name@sydney.edu.au
#PBS -m abe

BBMAP_OUT_DIR=/scratch/project name/bbmap

module load bbmap
module load java
module load samtools

cd $PBS_O_WORKDIR

#bbmap_script
mapPacBio.sh in=174152_1.fasta outu=$BBMAP_OUT_DIR/unmapped_174152_1.fasta outm=$BBMAP_OUT_DIR/mapped_174152_1.fasta maxlen=6000

This has a facility to map long reads to your genome but it splits them into ~6000 bases and renames them something like:

<read_abc_part_1

<read_abc_part_2

It then outputs the mapped and unmapped reads as fasta files. My thoughts were to use the unmapped reads for assembly. However there are two issues.

  1. The parts need to be rejoined.
  2. bbmap seems to bin the whole ‘part’ based on mapping to the genome. ie. if only a portion of the 6000 bases map to the mitochondrial genome it puts it into the mapped bin.

Now wondering if this was the correct approach. There is new software from S. Koren (Canu author) which maps PacBio data.

https://www.biorxiv.org/content/early/2018/02/05/259986

https://github.com/marbl/MashMap

I had some trepidation using this as it is only newly developed and may be poorly supported, however it is likely to be more suitable.

Further thoughts:

If any part of the read maps to the mitochondrial genome – the read should be derived from the mitochondrial genome. Therefore it is correct that the whole section is binned as mapped. My plan is:

  1. Extract headers from the mapped output.
  2. Use header list to extract reads from original read files.
  3. Use the reads (minus extracted mapped reads) to assemble the genome.

 

Polishing Canu genome

Having assembled the genome using Canu it is time to polish. This can be done with Pilon, using Illumina reads aligned to the genome. This can also be a time and compute heavy job (days). For example:

PBS -l select=1:ncpus=8:mem=160gb
PBS -l walltime=64:00:00

Now doing the polishing again. This time aligned the raw subread.bam files with the genome using pbalign from SMRT analysis tools. A good explanation was found here after much confusion.

https://www.biostars.org/p/273447/

The command to make an xml dataset (of all the subread.bam files for pbalign is this below. Note the subreads.bam files are from the sequel data (one folder) and RSII data (another folder):

~/bin/smrtlink/smrtcmds/bin/dataset create –type SubreadSet –name fungi  fungi_set.xml ../RSII/*.subreads.bam ../sequel/*.subreads.bam

The main confusing bit is the name ‘bam’. Here we have pre-aligned raw bams (subread.bams). Then, after aligning using pbalign, we have aligned bams. These aligned reads are the ones to use in the polishing with arrow from SMRT-tools.

http://www.pacb.com/support/software-downloads/