I have a draft mitochondrial genome from my organism. I now have a lot more PacBio data to assemble. I want to extract mitochondrial reads before assembling genomic reads in the hope that this will make the Canu assembly quicker.
Decided to try using bbmap.
#!/bin/bash
#PBS -project name
#PBS -N bbmap
#PBS -l select=1:ncpus=2:mem=64GB
#PBS -l walltime=24:00:00
#PBS -e ./bbmap_error.txt
#PBS -o ./bbmap_output.txt
#PBS -M name@sydney.edu.au
#PBS -m abe
BBMAP_OUT_DIR=/scratch/project name/bbmap
module load bbmap
module load java
module load samtools
cd $PBS_O_WORKDIR
#bbmap_script
mapPacBio.sh in=174152_1.fasta outu=$BBMAP_OUT_DIR/unmapped_174152_1.fasta outm=$BBMAP_OUT_DIR/mapped_174152_1.fasta maxlen=6000
This has a facility to map long reads to your genome but it splits them into ~6000 bases and renames them something like:
<read_abc_part_1
<read_abc_part_2
It then outputs the mapped and unmapped reads as fasta files. My thoughts were to use the unmapped reads for assembly. However there are two issues.
- The parts need to be rejoined.
- bbmap seems to bin the whole ‘part’ based on mapping to the genome. ie. if only a portion of the 6000 bases map to the mitochondrial genome it puts it into the mapped bin.
Now wondering if this was the correct approach. There is new software from S. Koren (Canu author) which maps PacBio data.
https://www.biorxiv.org/content/early/2018/02/05/259986
https://github.com/marbl/MashMap
I had some trepidation using this as it is only newly developed and may be poorly supported, however it is likely to be more suitable.
Further thoughts:
If any part of the read maps to the mitochondrial genome – the read should be derived from the mitochondrial genome. Therefore it is correct that the whole section is binned as mapped. My plan is:
- Extract headers from the mapped output.
- Use header list to extract reads from original read files.
- Use the reads (minus extracted mapped reads) to assemble the genome.