I have a draft mitochondrial genome from my organism. I now have a lot more PacBio data to assemble. I want to extract mitochondrial reads before assembling genomic reads in the hope that this will make the Canu assembly quicker.
Decided to try using bbmap.
#PBS -project name
#PBS -N bbmap
#PBS -l select=1:ncpus=2:mem=64GB
#PBS -l walltime=24:00:00
#PBS -e ./bbmap_error.txt
#PBS -o ./bbmap_output.txt
#PBS -M firstname.lastname@example.org
#PBS -m abe
module load bbmap
module load java
module load samtools
mapPacBio.sh in=174152_1.fasta outu=$BBMAP_OUT_DIR/unmapped_174152_1.fasta outm=$BBMAP_OUT_DIR/mapped_174152_1.fasta maxlen=6000
This has a facility to map long reads to your genome but it splits them into ~6000 bases and renames them something like:
It then outputs the mapped and unmapped reads as fasta files. My thoughts were to use the unmapped reads for assembly. However there are two issues.
- The parts need to be rejoined.
- bbmap seems to bin the whole ‘part’ based on mapping to the genome. ie. if only a portion of the 6000 bases map to the mitochondrial genome it puts it into the mapped bin.
Now wondering if this was the correct approach. There is new software from S. Koren (Canu author) which maps PacBio data.
I had some trepidation using this as it is only newly developed and may be poorly supported, however it is likely to be more suitable.
If any part of the read maps to the mitochondrial genome – the read should be derived from the mitochondrial genome. Therefore it is correct that the whole section is binned as mapped. My plan is:
- Extract headers from the mapped output.
- Use header list to extract reads from original read files.
- Use the reads (minus extracted mapped reads) to assemble the genome.