Category Archives: alignment

Using bbmap to extract reads – is this the best approach?

I have a draft mitochondrial genome from my organism. I now have a lot more PacBio data to assemble. I want to extract mitochondrial reads before assembling genomic reads in the hope that this will make the Canu assembly quicker.

Decided to try using bbmap.

#!/bin/bash
#PBS -project name
#PBS -N bbmap
#PBS -l select=1:ncpus=2:mem=64GB
#PBS -l walltime=24:00:00
#PBS -e ./bbmap_error.txt
#PBS -o ./bbmap_output.txt
#PBS -M name@sydney.edu.au
#PBS -m abe

BBMAP_OUT_DIR=/scratch/project name/bbmap

module load bbmap
module load java
module load samtools

cd $PBS_O_WORKDIR

#bbmap_script
mapPacBio.sh in=174152_1.fasta outu=$BBMAP_OUT_DIR/unmapped_174152_1.fasta outm=$BBMAP_OUT_DIR/mapped_174152_1.fasta maxlen=6000

This has a facility to map long reads to your genome but it splits them into ~6000 bases and renames them something like:

<read_abc_part_1

<read_abc_part_2

It then outputs the mapped and unmapped reads as fasta files. My thoughts were to use the unmapped reads for assembly. However there are two issues.

  1. The parts need to be rejoined.
  2. bbmap seems to bin the whole ‘part’ based on mapping to the genome. ie. if only a portion of the 6000 bases map to the mitochondrial genome it puts it into the mapped bin.

Now wondering if this was the correct approach. There is new software from S. Koren (Canu author) which maps PacBio data.

https://www.biorxiv.org/content/early/2018/02/05/259986

https://github.com/marbl/MashMap

I had some trepidation using this as it is only newly developed and may be poorly supported, however it is likely to be more suitable.

Further thoughts:

If any part of the read maps to the mitochondrial genome – the read should be derived from the mitochondrial genome. Therefore it is correct that the whole section is binned as mapped. My plan is:

  1. Extract headers from the mapped output.
  2. Use header list to extract reads from original read files.
  3. Use the reads (minus extracted mapped reads) to assemble the genome.

 

Finally to get the counts

Now with all the bam files, indexed bam files and the bed file made previously we can get read counts per transcript. The moment you are anticipating is not far off. Make sure the bam.bai files are all in the same working directory.

We use BEDtools v2.24.0, with the multiBamCov option.

#PBS -P project
#PBS -N Syzygium_counts
#PBS -l nodes=1:ppn=20
#PBS -l walltime=01:00:00
#PBS -l pmem=24gb
#PBS -e ./Syzygium_counts.txt
#PBS -M email@sydney.edu.au
#PBS -m abe

# Load modules
module load bedtools

# Working directory
cd /working directory location

# Run bedtools script
multiBamCov -q 30 -p -bams AP0_coordSorted.bam AP1_coordSorted.bam AP2_coordSorted.bam -bed Syzygium.BED > Syzygium_counts.txt

The output from this will be a large text file with the numbers of reads that aligned to each transcript.

TRINITY_DN4630_c0_g1_i1 0 2449 68 0 18 30
TRINITY_DN4630_c0_g2_i1 0 457 2 0 4 4 0
TRINITY_DN4679_c0_g1_i1 0 410 2 0 2 0 0
TRINITY_DN4674_c0_g1_i1 0 227 4 0 0 0 0
TRINITY_DN4609_c0_g1_i1 0 265 0 0 0 0 0
TRINITY_DN4651_c0_g1_i1 0 1030 8 0 4 4

which you can put into a spreadsheet and name the columns. And the columns are in the order that the bam files are placed in the script. Of course you would include all the bam files for all samples and all times in the one run using bedtools.

Gene ID start end AP0 AP1 AP2
TRINITY_DN4630_c0_g1_i1 0 2449 68 0 18
TRINITY_DN4630_c0_g2_i1 0 457 2 0 4
TRINITY_DN4679_c0_g1_i1 0 410 2 0 2
TRINITY_DN4674_c0_g1_i1 0 227 4 0 0
TRINITY_DN4609_c0_g1_i1 0 265 0 0 0
TRINITY_DN4651_c0_g1_i1 0 1030 8 0 4

Aligning the reads to the transcriptome

Armed with the clustered fasta file and the trimmed raw reads you can now determine the read hits per transcript.

Alignment functions are supported within the Trinity platform, thereby incorporating the use of these third party tools, providing they are installed. Installed software required include; Bowtie (I used v1.1.2) and SAMtools (v1.2) to give outputs of bam files. Also required is Perl and of course Trinity – if running within the streamlined utilities.

#PBS -P project name
#PBS -N AP0_alignment
#PBS -l nodes=1:ppn=20
#PBS -l walltime=20:00:00
#PBS -l pmem=24gb
#PBS -e ./AP0_alignment.txt
#PBS -M email@sydney.edu.au
#PBS -m abe

# Load modules
module load perl
module load bowtie
module load java
module load samtools/1.2
module load trinity/2.1.1

# Working directory
cd /path to working directory

# Run bowtie script
/usr/local/trinity/2.1.1/util/bowtie_PE_separate_then_join.pl –seqType fq \
–left AP0_R1_pairedwithunpaired.trim.fq –right AP0_R2_pairedwithunpaired.trim.fq \
–target Syzygium.fasta –aligner bowtie \
— -p 4 –all –best –strata -m 300

From this will be output a folder called bowtie containing bam files and indexed bams. For the next step you will use the bam files from all samples and all times to determine the counts per transcript.I rename the bam files according the plant/time eg. AP0_coordSorted.bam and AP0_coordSorted.bam.bai