Category Archives: Trinity software

Aligning the reads to the transcriptome

Armed with the clustered fasta file and the trimmed raw reads you can now determine the read hits per transcript.

Alignment functions are supported within the Trinity platform, thereby incorporating the use of these third party tools, providing they are installed. Installed software required include; Bowtie (I used v1.1.2) and SAMtools (v1.2) to give outputs of bam files. Also required is Perl and of course Trinity – if running within the streamlined utilities.

#PBS -P project name
#PBS -N AP0_alignment
#PBS -l nodes=1:ppn=20
#PBS -l walltime=20:00:00
#PBS -l pmem=24gb
#PBS -e ./AP0_alignment.txt
#PBS -M email@sydney.edu.au
#PBS -m abe

# Load modules
module load perl
module load bowtie
module load java
module load samtools/1.2
module load trinity/2.1.1

# Working directory
cd /path to working directory

# Run bowtie script
/usr/local/trinity/2.1.1/util/bowtie_PE_separate_then_join.pl –seqType fq \
–left AP0_R1_pairedwithunpaired.trim.fq –right AP0_R2_pairedwithunpaired.trim.fq \
–target Syzygium.fasta –aligner bowtie \
— -p 4 –all –best –strata -m 300

From this will be output a folder called bowtie containing bam files and indexed bams. For the next step you will use the bam files from all samples and all times to determine the counts per transcript.I rename the bam files according the plant/time eg. AP0_coordSorted.bam and AP0_coordSorted.bam.bai

Transcriptome output

Since the last post I have re-run the assembly with both paired and unpaired trimmed sequences. My new transcriptome is not very different in size but I am yet to determine if there are other variations for downstream analysis.

To obtain some statistics of the trinity output you can use the N50 check.

#Run N50 script
/usr/local/trinity/2.1.1/util/TrinityStats.pl AP.fasta

with an output like this:

################################
## Counts of transcripts, etc.
################################
Total trinity ‘genes’: 67231
Total trinity transcripts: 82911
Percent GC: 45.44

########################################
Stats based on ALL transcript contigs:
########################################

Contig N10: 3700
Contig N20: 2863
Contig N30: 2350
Contig N40: 1979
Contig N50: 1654

Median contig length: 585
Average contig: 983.42
Total assembled bases: 81535957
#####################################################
## Stats based on ONLY LONGEST ISOFORM per ‘GENE’:
#####################################################

Contig N10: 3463
Contig N20: 2612
Contig N30: 2127
Contig N40: 1758
Contig N50: 1408

Median contig length: 476
Average contig: 834.77
Total assembled bases: 56122482

Compared to previous assembly:

################################
## Counts of transcripts, etc.
################################
Total trinity ‘genes’: 67123
Total trinity transcripts: 82776
Percent GC: 45.44

########################################
Stats based on ALL transcript contigs:
########################################

Contig N10: 3687
Contig N20: 2864
Contig N30: 2350
Contig N40: 1978
Contig N50: 1652

Median contig length: 586
Average contig: 983.62
Total assembled bases: 81419959
#####################################################
## Stats based on ONLY LONGEST ISOFORM per ‘GENE’:
#####################################################

Contig N10: 3456
Contig N20: 2612
Contig N30: 2126
Contig N40: 1757
Contig N50: 1407

Median contig length: 476
Average contig: 834.73
Total assembled bases: 56029878

 

 

Building the transcriptome

The purpose of many RNAseq studies is to compare the gene expression patterns across different treatments. In order to do this, as I mentioned in the previous post, all the reads from each treatment (from the one plant) need to be combined first.

Trinity assembly software incorporates other tools as well as assembly, for example Trimmomatic can be run within the trinity pipeline. To do the trimming and assembly you would run a script like this (note no spaces or line breaks):

Trinity –seqType fq –max_memory 24G –CPU 7 –left AP0_R1.fastq,AP1_R1.fastq,AP2_R1.fastq –right AP0_R2.fastq,AP1_R2.fastq,AP2_R2.fastq–trimmomatic –output AP_trinity

This will build all the reads into a complete transcriptome for the plant labelled AP from RNA expressed pre-inoculation, at 24 hours and at 48 hours.

Depending on the success of the sequencing, ie. the number and length of raw reads, the assembly process will take around 1 to 3 days of wall-time. Below is the pbs script to run for trimmed reads.

#PBS -P (project name)
#PBS -N AP_assembly
#PBS -l nodes=1:ppn=24
#PBS -l walltime=100:00:00
#PBS -l pmem=24gb
#PBS -e ./AP_trinity/AP_assembly.txt
#PBS -M email@sydney.edu.au
#PBS -m abe

# Load modules
module load bowtie
module load java
module load trinity/2.1.1

# Working directory
cd /project/full path name to your working directory (where all the input files are)

# Run trinity
Trinity –seqType fq –max_memory 24G –CPU 24 –left AP0_R1.trimpaired.fastq,AP1_R 1.trimpaired.fastq,AP2_R1.trimpaired.fastq –right AP0_R2.trimpaired.fastq,AP1_R2.trimpaired.fastq,AP2_R2.trimpaired.fastq –output AP_trinity

NB: As I said previously, I am learning on the process as I go. Actually writing this bog is making me find the mistakes I have made along the way. Just found an error in the way I ran the above trinity script. It appears that the paired and unpaired trimmed reads are combined with the trinity assembly. I have only used my paired reads. At this point I am way down the pipeline so this is a big blow! I have decided to continue with my current pipeline as well as re-run the assemblies with both the paired and un-paired reads and then compare the outcomes.

Why do RNAseq?

There is always a scientific question at the basis of doing RNAseq. Here are some images of susceptible Myrtaceae plants infected with myrtle rust. Clockwise from top left: Syzygium luehmannii (Riberry), Syzygium leuhmannii, Syzygium jambos, Chamelaucium uncinatum (Geraldton wax) flower buds, Chamelaucium uncinatum stems and leaves. For highly susceptible plants the pathogen can be devastating. Determining the genetic basis for resistance would therefore be very useful for plant breeding and for population and species management. Whole gene expression studies are one way to determine this genetic basis by identifying the changes in resistant versus susceptible plants.

myrtle rust

 

Trinity assembly – and how not to go about it

I have used the Trinity assembly software to build denovo transcriptomes for each of my plants. Lots of good information here:

https://github.com/trinityrnaseq/trinityrnaseq/wiki

I made several errors to begin with. The first error was to build transcriptomes for each plant at each time-point thinking that this would allow me to compare across the treatments. Running the assembly software can take days to complete – so this has meant a lot of time wasted doing 24 separate assemblies!

The correct thing to do is to combine the data from all times (pre-inoculation, 24 hours post-inoculation and 48 hours post-inoculation) for each plant. In my case I did 8 assemblies. You need all the gene transcripts from all samples to build the most complete ‘potential’ expression profile. You can then use this transcriptome fasta file to map the reads from each treatment later on. The transcriptome is also really useful for other analysis you might want to do later such as searching for specific gene families etc.

more on trinity assembly to come later…