Category Archives: RNAseq

Things to do better next time

I have had some errors found in my assembled transcriptomes when they were submitted to NCBI. It appears that I missed a couple of quality control steps at the trimming stage.

What I have learnt:

  • after using the default trimmomatic settings – check the data before proceeding.
  • incorporate a file with all Illumina primers for identifying and trimming these from reads.
  • use bbduk to help with this process – as other residual primers may be there – from previous runs on the machine (
  • screen the raw data for contaminants, eg. bacterial, human and other non-target sequences. Seems my transcriptomes included a  few sequences from humans, gorilla and chimp!!
  • None of this really affected my analysis and, in fact, was such a small proportion of the assembled transcriptome BUT – the assemblies are now a bit questionable and should be done again.

Re-working the analysis

I have been going back into my transcriptomes for each plant and pulling out ‘genes’ that fit with Hidden Markov Models I made using HMMER ( I used the specific domains from resistance gene models, previously identified in Eucalyptus grandis ( and chitinases ( to initially find putative genes in my clustered Syzygium. Then I used the aligned genes to build species specific nucleotide HMM for several defence-related genes.

Armed with lists of potential genes in each transcriptome I now want to find out if there are actually transcript variations between my resistant and susceptible plants.

This has meant aligning my raw reads for each plant/at each time against its own transcriptome. A big job again – hope it is worth it.

Another thing that has been useful recently is the regular expressions for sorting etc in Notepad +++ ( With large datasets it seems there is always something needing extracting or amending.

Have a bunch of primers ready as well to check against all samples. Back to the lab soon for much qRTPCR.


Working on visually presenting the results

I have been busy writing my thesis so no posts for a while. Interspersed with writing I have been finding ways to present the data. There are many ways to visualize the differential expression such as heatmaps and volcano plots. Both do-able in R, but I am now looking into some python modules that look promising.

Anyway here are volcano plots (using R) from DE expression between 0 and 48 hours post inoculation  (A) susceptible and (B) resistant plants. Each dot represents a transcript. Red =  FDR <0.01, Green = FDR<0.01 and l log2 fold change >4 , Yellow and labelled a-j= FDR< 1.00E-07 and log2 fold change >4 . Gene homologues labelled are; a = zinc finger protein, b,d = carboxylesterase 12, c = Puccinia psidii ITS, e = zinc finger CCCH domain-containing protein,  f= myosin heavy chain kinase, uncharacterized, g = metalloendoproteinase, h= strigolactone esterase.


Finally to get the counts

Now with all the bam files, indexed bam files and the bed file made previously we can get read counts per transcript. The moment you are anticipating is not far off. Make sure the bam.bai files are all in the same working directory.

We use BEDtools v2.24.0, with the multiBamCov option.

#PBS -P project
#PBS -N Syzygium_counts
#PBS -l nodes=1:ppn=20
#PBS -l walltime=01:00:00
#PBS -l pmem=24gb
#PBS -e ./Syzygium_counts.txt
#PBS -m abe

# Load modules
module load bedtools

# Working directory
cd /working directory location

# Run bedtools script
multiBamCov -q 30 -p -bams AP0_coordSorted.bam AP1_coordSorted.bam AP2_coordSorted.bam -bed Syzygium.BED > Syzygium_counts.txt

The output from this will be a large text file with the numbers of reads that aligned to each transcript.

TRINITY_DN4630_c0_g1_i1 0 2449 68 0 18 30
TRINITY_DN4630_c0_g2_i1 0 457 2 0 4 4 0
TRINITY_DN4679_c0_g1_i1 0 410 2 0 2 0 0
TRINITY_DN4674_c0_g1_i1 0 227 4 0 0 0 0
TRINITY_DN4609_c0_g1_i1 0 265 0 0 0 0 0
TRINITY_DN4651_c0_g1_i1 0 1030 8 0 4 4

which you can put into a spreadsheet and name the columns. And the columns are in the order that the bam files are placed in the script. Of course you would include all the bam files for all samples and all times in the one run using bedtools.

Gene ID start end AP0 AP1 AP2
TRINITY_DN4630_c0_g1_i1 0 2449 68 0 18
TRINITY_DN4630_c0_g2_i1 0 457 2 0 4
TRINITY_DN4679_c0_g1_i1 0 410 2 0 2
TRINITY_DN4674_c0_g1_i1 0 227 4 0 0
TRINITY_DN4609_c0_g1_i1 0 265 0 0 0
TRINITY_DN4651_c0_g1_i1 0 1030 8 0 4

Aligning the reads to the transcriptome

Armed with the clustered fasta file and the trimmed raw reads you can now determine the read hits per transcript.

Alignment functions are supported within the Trinity platform, thereby incorporating the use of these third party tools, providing they are installed. Installed software required include; Bowtie (I used v1.1.2) and SAMtools (v1.2) to give outputs of bam files. Also required is Perl and of course Trinity – if running within the streamlined utilities.

#PBS -P project name
#PBS -N AP0_alignment
#PBS -l nodes=1:ppn=20
#PBS -l walltime=20:00:00
#PBS -l pmem=24gb
#PBS -e ./AP0_alignment.txt
#PBS -m abe

# Load modules
module load perl
module load bowtie
module load java
module load samtools/1.2
module load trinity/2.1.1

# Working directory
cd /path to working directory

# Run bowtie script
/usr/local/trinity/2.1.1/util/ –seqType fq \
–left AP0_R1_pairedwithunpaired.trim.fq –right AP0_R2_pairedwithunpaired.trim.fq \
–target Syzygium.fasta –aligner bowtie \
— -p 4 –all –best –strata -m 300

From this will be output a folder called bowtie containing bam files and indexed bams. For the next step you will use the bam files from all samples and all times to determine the counts per transcript.I rename the bam files according the plant/time eg. AP0_coordSorted.bam and AP0_coordSorted.bam.bai

Running biopython scripts

As I am learning a bit about scripts and how to run them I have made recent errors which probably appear very simple to someone from a computing background.

The word I have learnt is ‘parse‘ which means:

  • to analyze (a sentence) into its parts and describe their syntactic roles.
  • analyze (a string or text) into logical syntactic components, typically in order to test conformability to a logical grammar.
  • examine or analyze minutely.

Another word is to ‘pass‘. You pass some instruction to the software to parse (analyze).

To run software you need:

  • the software and dependencies (other software) to be installed
  • script to tell the program what you want it to do
  • input files

I managed to find this very useful biopython script to make bed files from my newly made transcriptomes. Bed files are tab  delimited files that include the sequence ID, and the start and end location. For new transcriptomes the start is always 0 and the end is the length of the contig. These files can be used to gather the count data from your alignment files per sequence ID (gene).

In case the website above is unavailable, here is the biopython script, slightly altered for the later version of Biopython which requires brackets around the print function:

import sys
from Bio import SeqIO

fasta_handle = open(sys.argv[1], "rU") #Open the fasta file specified by the user on the command line

#Go through all the records in the fasta file
for record in SeqIO.parse(fasta_handle, "fasta"):
    cur_id = #Name of the current feature
    cur_length = len(record.seq) #Size of the current feature
    print (("%s\t%d\t%d") % (cur_id, 0, cur_length)) #Output the name, start, and end coordinates to the screen

This script will take any fasta file you give it and output the bed file. The important thing is to tell python.exe what to do with the input file. You need to pass the script to python.exe so it can parse and act on it. Therefore the order of your command must be correct – the .py script comes directly after the python.exe.

I placed the input file first and could not understand why I kept getting the error that my fasta file had syntax errors.

So in the command line window – I just run this in Windows cmd.exe by holding shift and right-clicking on the folder I want as my working directory;

Open the command window in the directory where your fasta file is, then input the directory path to Python34\python.exe then input the directory path to the script and output to your working directory.

So it will look something like this:

D:\Syzygium\>C:Python34\python.exe D:\scripts\ Syzygium.fasta > Syzygium.bed

The output will look like this showing the length of each ‘gene’.

TRINITY_DN4638_c0_g1_i1 0 486
TRINITY_DN4638_c0_g2_i1 0 486
TRINITY_DN4608_c0_g1_i1 0 442
TRINITY_DN4687_c0_g1_i1 0 207
TRINITY_DN4630_c0_g1_i1 0 2449
TRINITY_DN4630_c0_g2_i1 0 457
TRINITY_DN4679_c0_g1_i1 0 410
TRINITY_DN4674_c0_g1_i1 0 227
TRINITY_DN4609_c0_g1_i1 0 265
TRINITY_DN4651_c0_g1_i1 0 1030
TRINITY_DN4626_c0_g1_i1 0 468
TRINITY_DN4626_c1_g1_i1 0 286
TRINITY_DN4653_c0_g2_i1 0 555
TRINITY_DN4653_c1_g1_i1 0 793
TRINITY_DN4648_c0_g1_i1 0 605
TRINITY_DN4632_c0_g1_i1 0 2026
TRINITY_DN4672_c0_g1_i1 0 239


Clustering the transcriptomes

I have now made eight transcriptomes. Four from the resistant plants and four from the susceptible. To be able to compare expression from different times, and between the different biological replicates, I have to make a single transcriptome for the plant species, in this case, Syzygium (Lilly Pilly). This is a bit concerning as my plants are grown from wild sourced seeds and are likely to be highly heterozygous.

My approach therefore was to use the CD-HIT-EST-2D ( software to combine transcriptomes, initially at 0.95  similarity. I clustered the transcriptomes in pairs from the largest to the smallest until I had a single fasta file.

#PBS -P (project)
#PBS -l nodes=1:ppn=20
#PBS -l walltime=00:15:00
#PBS -l pmem=24gb
#PBS -e ./SYZ1.error
#PBS -M (email address)
#PBS -m abe

# Load modules
module load cd-hit/4.6.1

# Working directory
Input path to directory

# Run CD-Hit script
cd-hit-est-2d -i BU.fasta -i2 BS.fasta -o SYZ1.fasta -c 0.95 -n 10 -d 0 -M 0 -T 0

I then sought out highly conserved genes using local BLAST in BioEdit. I used a couple of Eucalyptus grandis chitinase genes that I know to be well conserved from previous work, currently under review.

I found that the genes were present and intact in the individual transcriptomes but not in the combined one. I therefore ran the process with CD-HIT again with a higher stringency of 0.98.

This time the genes were present in the final transcriptome making me feel more confident about the alignment process.

Why do RNAseq?

There is always a scientific question at the basis of doing RNAseq. Here are some images of susceptible Myrtaceae plants infected with myrtle rust. Clockwise from top left: Syzygium luehmannii (Riberry), Syzygium leuhmannii, Syzygium jambos, Chamelaucium uncinatum (Geraldton wax) flower buds, Chamelaucium uncinatum stems and leaves. For highly susceptible plants the pathogen can be devastating. Determining the genetic basis for resistance would therefore be very useful for plant breeding and for population and species management. Whole gene expression studies are one way to determine this genetic basis by identifying the changes in resistant versus susceptible plants.

myrtle rust


Trinity assembly – and how not to go about it

I have used the Trinity assembly software to build denovo transcriptomes for each of my plants. Lots of good information here:

I made several errors to begin with. The first error was to build transcriptomes for each plant at each time-point thinking that this would allow me to compare across the treatments. Running the assembly software can take days to complete – so this has meant a lot of time wasted doing 24 separate assemblies!

The correct thing to do is to combine the data from all times (pre-inoculation, 24 hours post-inoculation and 48 hours post-inoculation) for each plant. In my case I did 8 assemblies. You need all the gene transcripts from all samples to build the most complete ‘potential’ expression profile. You can then use this transcriptome fasta file to map the reads from each treatment later on. The transcriptome is also really useful for other analysis you might want to do later such as searching for specific gene families etc.

more on trinity assembly to come later…


Trimming the reads

Trimming the reads

The raw sequence reads still have Illumina adaptors and some of them might be low quality. The software often used for trimming reads is Trimmomatic (I used trimmomatic v0.33). Manual available here:

Click to access TrimmomaticManual_V0.32.pdf

Trimmomatic requires Java to be installed and loaded, the default java version on Artemis (hpc) is currently  1.8.0 and this works fine. I did not change any default settings for Trimmomatic.

Here is the pbs script for my paired end data

Note: the term ‘module’ means software and there is generally a default version that is uploaded for your work unless you specify the version in the Load Modules section of the pbs),

#PBS -P (project directory)
#PBS -N (job name eg. AP0 trim)
#PBS -l nodes=1:ppn=16
#PBS -l walltime=01:00:00
#PBS -l pmem=4gb
#PBS -m abe

#Load modules
module load java
module load trimmomatic

# Working directory
cd /(pathname of working directory where files are located)

# Run trimmomatic
java -jar /usr/local/trimmomatic/0.33/trimmomatic-0.33.jar PE -phred33 AP0_R1.fastq AP0_R2.fastq AP0_R1_trimpaired.fq AP0_R1_trimunpaired.fq AP0_R2_trimpaired.fq AP0_R2_trimunpaired.fq ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

So the input files are:

AP0_R1.fastq and AP0_R2.fastq (or gzipped versions)

and the output files are:

AP0_R1_trimpaired.fq, AP0_R1_trimunpaired.fq, AP0_R2_trimpaired.fq and AP0_R2_trimunpaired.fq (or gzipped versions)

These output files can be used for the assembly process with Trinity software.


Starting out with your raw reads

You have your data – now what?

Your data will usually come to you from the sequencing facility as masses of short reads in zipped files. The files will be labelled something like this:





So for this sample AP0 – (plant labeled AP and from pre-inoculation therefore ‘0’), I have paired-end data. R1 is from the forward read and R2 is from the reverse read. The Illumina run was done using two lanes therefore I have two R1 and two R2 datasets (L001 and L002).

These zipped files are .fastq files. Fasta files are the basic text files that gene sequences are formatted in with headers beginning with a  ‘>’ followed by the sequence on the next line. If they are not too massive they can be opened in Notepad+++. Fastq files have additional information about the quality of the read which cannot be easily understood by humans but can allow software to sort through and discard poor reads. More information here,

You want all the sample AP0 R1 reads from two sequencing lanes to be in one file. The same goes for the R2 reads. So you will need to unzip the files and then combine them explained below (in Back to the zipped raw data).

Unix and high performance computing

Most of the work involved in RNAseq analysis requires a Unix environment (command line) rather than Windows or graphic user interfaces. To run assembly and alignment software you also really need high performance computing (hpc) access. There are a number of ways to access these such as:

but as I have access to the university cluster I will be just discussing the methods I have used for Artemis, the University of Sydney hpc.

On my local Windows operating system I have installed PuTTY software ( which allows me to log in to the hpc remotely using a VPN and operate the software with my data in a project directory allocated to me. I use Filezilla ( for transferring much of my data from my local data storage location to the hpc and back.

PBS scripts

When using hpc clusters all jobs need to be scheduled and queued so that appropriate allocation of resources occurs. To run software on your data you submit a portable batch script (PBS) which specifies the software you need to use, the resources you need and the time expected to run the job (wall-time). It is often a bit of guesswork to know how long a job will take but the manuals for software, always available on-line, will give some guidance. Here is an example of information required in a PBS script for Artemis:

# Some useful brief comment on what this script does
#PBS -P RDS-FAC-PROJECT-RW – make sure to replace FAC and PROJECT to suit you
#PBS -N Job_101  -Job name – this is what you will see when you run commands to check the status of your job
#PBS -l nodes=1:ppn=2 – This is requesting 1 node and 2 processor per node. Only request more if your job multi-threads. 1 node has a max of 24 processors, so say you wanted 40 cores you would specify ‘nodes=2:ppn=20’
#PBS -l walltime=02:00:00 – Maximum run time of your job, in hours:minutes:seconds. Example requests 2 hours. Job is killed when it reaches wall time so make sure you request enough.
#PBS -l pmem=4gb – RAM per processor
#PBS -j oe – Send the ‘standard error’ and ‘standard output’ log files to one file (delete this line to keep them separate)
#PBS -M – email job reports to this email address
#PBS -m abe – along with the above directive, this asks for an email when the job aborts (a), begins (b), and ends (e). ‘e’ is the most important as it will give you a report of resources used which can help you decide how much resources to request for the next run.

# Load modules
module load trinity – job will use the software trinity
module load trimmomatic – job will use the software trimmomatic

# Working directory: – if all your input and output is in the same directory, if you don’t have all the data in the same directory it can be useful to have this ‘cd’ command in the script -so you can run the script from any directory without specifying full pathnames for the input and output files.

# Run trinity: – insert commands to run the job…

I will provide some of the actual pbs scripts I have used as I go.

Back to the zipped raw data

Once you have your data on the hpc you can unzip using the unix command ‘gunzip’ as these are gzipped files (gz):

gunzip ‘filename’

eg. gunzip AP0_CGATGT_L001_R1_001.fastq.gz

Your original .gz file will be gone from the directory and replaced with the unzipped version.

With your files unzipped you can now join them with the unix command cat and, in my example, rename the new combined file AP0_R1.fastq:

cat AP0_CGATGT_L001_R1_001.fastq AP0_CGATGT_L002_R1_001.fastq > AP0_R1.fastq

When you have all your data combined like this you can zip them again and begin processing them. Much of the software for assembly can process zipped files. To zip them:

gzip ‘filename’

Next post – trimming the reads