counts from bam files | bioinformaticsforbeginners

I have been going back into my transcriptomes for each plant and pulling out ‘genes’ that fit with Hidden Markov Models I made using HMMER (http://hmmer.org/). I used the specific domains from resistance gene models, previously identified in Eucalyptus grandis (http://journal.frontiersin.org/article/10.3389/fpls.2015.01238/full) and chitinases (https://academic.oup.com/treephys/article-abstract/37/5/565/3067625/Identification-of-the-Eucalyptus-grandis-chitinase) to initially find putative genes in my clustered Syzygium. Then I used the aligned genes to build species specific nucleotide HMM for several defence-related genes.

Armed with lists of potential genes in each transcriptome I now want to find out if there are actually transcript variations between my resistant and susceptible plants.

This has meant aligning my raw reads for each plant/at each time against its own transcriptome. A big job again – hope it is worth it.

Another thing that has been useful recently is the regular expressions for sorting etc in Notepad +++ (http://www.rexegg.com/regex-quickstart.html). With large datasets it seems there is always something needing extracting or amending.

Have a bunch of primers ready as well to check against all samples. Back to the lab soon for much qRTPCR.

Now with all the bam files, indexed bam files and the bed file made previously we can get read counts per transcript. The moment you are anticipating is not far off. Make sure the bam.bai files are all in the same working directory.

We use BEDtools v2.24.0, with the multiBamCov option.

#PBS -P project
#PBS -N Syzygium_counts
#PBS -l nodes=1:ppn=20
#PBS -l walltime=01:00:00
#PBS -l pmem=24gb
#PBS -e ./Syzygium_counts.txt
#PBS -M email@sydney.edu.au
#PBS -m abe

# Load modules
module load bedtools

# Working directory
cd /working directory location

# Run bedtools script
multiBamCov -q 30 -p -bams AP0_coordSorted.bam AP1_coordSorted.bam AP2_coordSorted.bam -bed Syzygium.BED > Syzygium_counts.txt

The output from this will be a large text file with the numbers of reads that aligned to each transcript.

TRINITY_DN4630_c0_g1_i1 0 2449 68 0 18 30
TRINITY_DN4630_c0_g2_i1 0 457 2 0 4 4 0
TRINITY_DN4679_c0_g1_i1 0 410 2 0 2 0 0
TRINITY_DN4674_c0_g1_i1 0 227 4 0 0 0 0
TRINITY_DN4609_c0_g1_i1 0 265 0 0 0 0 0
TRINITY_DN4651_c0_g1_i1 0 1030 8 0 4 4

which you can put into a spreadsheet and name the columns. And the columns are in the order that the bam files are placed in the script. Of course you would include all the bam files for all samples and all times in the one run using bedtools.

Gene ID	start	end	AP0	AP1	AP2
TRINITY_DN4630_c0_g1_i1	0	2449	68	0	18
TRINITY_DN4630_c0_g2_i1	0	457	2	0	4
TRINITY_DN4679_c0_g1_i1	0	410	2	0	2
TRINITY_DN4674_c0_g1_i1	0	227	4	0	0
TRINITY_DN4609_c0_g1_i1	0	265	0	0	0
TRINITY_DN4651_c0_g1_i1	0	1030	8	0	4

	peritob on Moving to Marvel
	Niel on Moving to Marvel
	peritob on False starts with PacBio reads
	Ric on False starts with PacBio reads
	Yoav on Really great free software

bioinformaticsforbeginners

Category Archives: counts from bam files

Re-working the analysis

Finally to get the counts