Moving to Marvel

Well, Canu seems to do a reasonable job of assembling long reads for larger organisms but it is laboriously slow!! Trimming and Unitigging stages can take several weeks each. I am now using the reduced list of reads (after mapping to the mitochondrial genome – which reduced reads by ~4%) with the Marvel assembler.

https://github.com/schloi/MARVEL

I have been working through the scripts as per the Schmidtea mediterranea example from this recent Nature publication by Grohme et al 2018.

https://www.nature.com/articles/nature25473

So far so good. However the claim that by using repeat soft masking will reduce computational time by an order of magnitude is yet to be determined. Fingers crossed.

Using bbmap to extract reads – is this the best approach?

I have a draft mitochondrial genome from my organism. I now have a lot more PacBio data to assemble. I want to extract mitochondrial reads before assembling genomic reads in the hope that this will make the Canu assembly quicker.

Decided to try using bbmap.

#!/bin/bash
#PBS -project name
#PBS -N bbmap
#PBS -l select=1:ncpus=2:mem=64GB
#PBS -l walltime=24:00:00
#PBS -e ./bbmap_error.txt
#PBS -o ./bbmap_output.txt
#PBS -M name@sydney.edu.au
#PBS -m abe

BBMAP_OUT_DIR=/scratch/project name/bbmap

module load bbmap
module load java
module load samtools

cd $PBS_O_WORKDIR

#bbmap_script
mapPacBio.sh in=174152_1.fasta outu=$BBMAP_OUT_DIR/unmapped_174152_1.fasta outm=$BBMAP_OUT_DIR/mapped_174152_1.fasta maxlen=6000

This has a facility to map long reads to your genome but it splits them into ~6000 bases and renames them something like:

<read_abc_part_1

<read_abc_part_2

It then outputs the mapped and unmapped reads as fasta files. My thoughts were to use the unmapped reads for assembly. However there are two issues.

  1. The parts need to be rejoined.
  2. bbmap seems to bin the whole ‘part’ based on mapping to the genome. ie. if only a portion of the 6000 bases map to the mitochondrial genome it puts it into the mapped bin.

Now wondering if this was the correct approach. There is new software from S. Koren (Canu author) which maps PacBio data.

https://www.biorxiv.org/content/early/2018/02/05/259986

https://github.com/marbl/MashMap

I had some trepidation using this as it is only newly developed and may be poorly supported, however it is likely to be more suitable.

Further thoughts:

If any part of the read maps to the mitochondrial genome – the read should be derived from the mitochondrial genome. Therefore it is correct that the whole section is binned as mapped. My plan is:

  1. Extract headers from the mapped output.
  2. Use header list to extract reads from original read files.
  3. Use the reads (minus extracted mapped reads) to assemble the genome.