Category Archives: Transcriptomes

Aligning the reads to the transcriptome

Armed with the clustered fasta file and the trimmed raw reads you can now determine the read hits per transcript.

Alignment functions are supported within the Trinity platform, thereby incorporating the use of these third party tools, providing they are installed. Installed software required include; Bowtie (I used v1.1.2) and SAMtools (v1.2) to give outputs of bam files. Also required is Perl and of course Trinity – if running within the streamlined utilities.

#PBS -P project name
#PBS -N AP0_alignment
#PBS -l nodes=1:ppn=20
#PBS -l walltime=20:00:00
#PBS -l pmem=24gb
#PBS -e ./AP0_alignment.txt
#PBS -M email@sydney.edu.au
#PBS -m abe

# Load modules
module load perl
module load bowtie
module load java
module load samtools/1.2
module load trinity/2.1.1

# Working directory
cd /path to working directory

# Run bowtie script
/usr/local/trinity/2.1.1/util/bowtie_PE_separate_then_join.pl –seqType fq \
–left AP0_R1_pairedwithunpaired.trim.fq –right AP0_R2_pairedwithunpaired.trim.fq \
–target Syzygium.fasta –aligner bowtie \
— -p 4 –all –best –strata -m 300

From this will be output a folder called bowtie containing bam files and indexed bams. For the next step you will use the bam files from all samples and all times to determine the counts per transcript.I rename the bam files according the plant/time eg. AP0_coordSorted.bam and AP0_coordSorted.bam.bai

Clustering the transcriptomes

I have now made eight transcriptomes. Four from the resistant plants and four from the susceptible. To be able to compare expression from different times, and between the different biological replicates, I have to make a single transcriptome for the plant species, in this case, Syzygium (Lilly Pilly). This is a bit concerning as my plants are grown from wild sourced seeds and are likely to be highly heterozygous.

My approach therefore was to use the CD-HIT-EST-2D (http://weizhongli-lab.org/cd-hit/) software to combine transcriptomes, initially at 0.95  similarity. I clustered the transcriptomes in pairs from the largest to the smallest until I had a single fasta file.

#PBS -P (project)
#PBS -N SYZ1_CD-HIT
#PBS -l nodes=1:ppn=20
#PBS -l walltime=00:15:00
#PBS -l pmem=24gb
#PBS -e ./SYZ1.error
#PBS -M (email address)
#PBS -m abe

# Load modules
module load cd-hit/4.6.1

# Working directory
Input path to directory

# Run CD-Hit script
cd-hit-est-2d -i BU.fasta -i2 BS.fasta -o SYZ1.fasta -c 0.95 -n 10 -d 0 -M 0 -T 0

I then sought out highly conserved genes using local BLAST in BioEdit. I used a couple of Eucalyptus grandis chitinase genes that I know to be well conserved from previous work, currently under review.

I found that the genes were present and intact in the individual transcriptomes but not in the combined one. I therefore ran the process with CD-HIT again with a higher stringency of 0.98.

This time the genes were present in the final transcriptome making me feel more confident about the alignment process.

Transcriptome output

Since the last post I have re-run the assembly with both paired and unpaired trimmed sequences. My new transcriptome is not very different in size but I am yet to determine if there are other variations for downstream analysis.

To obtain some statistics of the trinity output you can use the N50 check.

#Run N50 script
/usr/local/trinity/2.1.1/util/TrinityStats.pl AP.fasta

with an output like this:

################################
## Counts of transcripts, etc.
################################
Total trinity ‘genes’: 67231
Total trinity transcripts: 82911
Percent GC: 45.44

########################################
Stats based on ALL transcript contigs:
########################################

Contig N10: 3700
Contig N20: 2863
Contig N30: 2350
Contig N40: 1979
Contig N50: 1654

Median contig length: 585
Average contig: 983.42
Total assembled bases: 81535957
#####################################################
## Stats based on ONLY LONGEST ISOFORM per ‘GENE’:
#####################################################

Contig N10: 3463
Contig N20: 2612
Contig N30: 2127
Contig N40: 1758
Contig N50: 1408

Median contig length: 476
Average contig: 834.77
Total assembled bases: 56122482

Compared to previous assembly:

################################
## Counts of transcripts, etc.
################################
Total trinity ‘genes’: 67123
Total trinity transcripts: 82776
Percent GC: 45.44

########################################
Stats based on ALL transcript contigs:
########################################

Contig N10: 3687
Contig N20: 2864
Contig N30: 2350
Contig N40: 1978
Contig N50: 1652

Median contig length: 586
Average contig: 983.62
Total assembled bases: 81419959
#####################################################
## Stats based on ONLY LONGEST ISOFORM per ‘GENE’:
#####################################################

Contig N10: 3456
Contig N20: 2612
Contig N30: 2126
Contig N40: 1757
Contig N50: 1407

Median contig length: 476
Average contig: 834.73
Total assembled bases: 56029878

 

 

Building the transcriptome

The purpose of many RNAseq studies is to compare the gene expression patterns across different treatments. In order to do this, as I mentioned in the previous post, all the reads from each treatment (from the one plant) need to be combined first.

Trinity assembly software incorporates other tools as well as assembly, for example Trimmomatic can be run within the trinity pipeline. To do the trimming and assembly you would run a script like this (note no spaces or line breaks):

Trinity –seqType fq –max_memory 24G –CPU 7 –left AP0_R1.fastq,AP1_R1.fastq,AP2_R1.fastq –right AP0_R2.fastq,AP1_R2.fastq,AP2_R2.fastq–trimmomatic –output AP_trinity

This will build all the reads into a complete transcriptome for the plant labelled AP from RNA expressed pre-inoculation, at 24 hours and at 48 hours.

Depending on the success of the sequencing, ie. the number and length of raw reads, the assembly process will take around 1 to 3 days of wall-time. Below is the pbs script to run for trimmed reads.

#PBS -P (project name)
#PBS -N AP_assembly
#PBS -l nodes=1:ppn=24
#PBS -l walltime=100:00:00
#PBS -l pmem=24gb
#PBS -e ./AP_trinity/AP_assembly.txt
#PBS -M email@sydney.edu.au
#PBS -m abe

# Load modules
module load bowtie
module load java
module load trinity/2.1.1

# Working directory
cd /project/full path name to your working directory (where all the input files are)

# Run trinity
Trinity –seqType fq –max_memory 24G –CPU 24 –left AP0_R1.trimpaired.fastq,AP1_R 1.trimpaired.fastq,AP2_R1.trimpaired.fastq –right AP0_R2.trimpaired.fastq,AP1_R2.trimpaired.fastq,AP2_R2.trimpaired.fastq –output AP_trinity

NB: As I said previously, I am learning on the process as I go. Actually writing this bog is making me find the mistakes I have made along the way. Just found an error in the way I ran the above trinity script. It appears that the paired and unpaired trimmed reads are combined with the trinity assembly. I have only used my paired reads. At this point I am way down the pipeline so this is a big blow! I have decided to continue with my current pipeline as well as re-run the assemblies with both the paired and un-paired reads and then compare the outcomes.

Starting out with your raw reads

You have your data – now what?

Your data will usually come to you from the sequencing facility as masses of short reads in zipped files. The files will be labelled something like this:

AP0_CGATGT_L001_R1_001.fastq.gz

AP0_CGATGT_L002_R1_001.fastq.gz

AP0_CGATGT_L001_R2_001.fastq.gz

AP0_CGATGT_L002_R2_001.fastq.gz

So for this sample AP0 – (plant labeled AP and from pre-inoculation therefore ‘0’), I have paired-end data. R1 is from the forward read and R2 is from the reverse read. The Illumina run was done using two lanes therefore I have two R1 and two R2 datasets (L001 and L002).

These zipped files are .fastq files. Fasta files are the basic text files that gene sequences are formatted in with headers beginning with a  ‘>’ followed by the sequence on the next line. If they are not too massive they can be opened in Notepad+++. Fastq files have additional information about the quality of the read which cannot be easily understood by humans but can allow software to sort through and discard poor reads. More information here,

https://en.wikipedia.org/wiki/FASTQ_format.

You want all the sample AP0 R1 reads from two sequencing lanes to be in one file. The same goes for the R2 reads. So you will need to unzip the files and then combine them explained below (in Back to the zipped raw data).

Unix and high performance computing

Most of the work involved in RNAseq analysis requires a Unix environment (command line) rather than Windows or graphic user interfaces. To run assembly and alignment software you also really need high performance computing (hpc) access. There are a number of ways to access these such as:

http://nci.org.au/access/getting-access-to-the-national-facility/allocation-schemes/

https://aws.amazon.com/hpc/resources/

https://nectar.org.au/

but as I have access to the university cluster I will be just discussing the methods I have used for Artemis, the University of Sydney hpc.

http://sydney.edu.au/research_support/hpc/access/

On my local Windows operating system I have installed PuTTY software (http://www.putty.org/) which allows me to log in to the hpc remotely using a VPN and operate the software with my data in a project directory allocated to me. I use Filezilla (https://filezilla-project.org/) for transferring much of my data from my local data storage location to the hpc and back.

PBS scripts

When using hpc clusters all jobs need to be scheduled and queued so that appropriate allocation of resources occurs. To run software on your data you submit a portable batch script (PBS) which specifies the software you need to use, the resources you need and the time expected to run the job (wall-time). It is often a bit of guesswork to know how long a job will take but the manuals for software, always available on-line, will give some guidance. Here is an example of information required in a PBS script for Artemis:

# Some useful brief comment on what this script does
#PBS -P RDS-FAC-PROJECT-RW – make sure to replace FAC and PROJECT to suit you
#PBS -N Job_101  -Job name – this is what you will see when you run commands to check the status of your job
#PBS -l nodes=1:ppn=2 – This is requesting 1 node and 2 processor per node. Only request more if your job multi-threads. 1 node has a max of 24 processors, so say you wanted 40 cores you would specify ‘nodes=2:ppn=20’
#PBS -l walltime=02:00:00 – Maximum run time of your job, in hours:minutes:seconds. Example requests 2 hours. Job is killed when it reaches wall time so make sure you request enough.
#PBS -l pmem=4gb – RAM per processor
#PBS -j oe – Send the ‘standard error’ and ‘standard output’ log files to one file (delete this line to keep them separate)
#PBS -M your.email@sydney.edu.au – email job reports to this email address
#PBS -m abe – along with the above directive, this asks for an email when the job aborts (a), begins (b), and ends (e). ‘e’ is the most important as it will give you a report of resources used which can help you decide how much resources to request for the next run.

# Load modules
module load trinity – job will use the software trinity
module load trimmomatic – job will use the software trimmomatic

# Working directory: – if all your input and output is in the same directory, if you don’t have all the data in the same directory it can be useful to have this ‘cd’ command in the script -so you can run the script from any directory without specifying full pathnames for the input and output files.

# Run trinity: – insert commands to run the job…

I will provide some of the actual pbs scripts I have used as I go.

Back to the zipped raw data

Once you have your data on the hpc you can unzip using the unix command ‘gunzip’ as these are gzipped files (gz):

gunzip ‘filename’

eg. gunzip AP0_CGATGT_L001_R1_001.fastq.gz

Your original .gz file will be gone from the directory and replaced with the unzipped version.

With your files unzipped you can now join them with the unix command cat and, in my example, rename the new combined file AP0_R1.fastq:

cat AP0_CGATGT_L001_R1_001.fastq AP0_CGATGT_L002_R1_001.fastq > AP0_R1.fastq

When you have all your data combined like this you can zip them again and begin processing them. Much of the software for assembly can process zipped files. To zip them:

gzip ‘filename’

Next post – trimming the reads

 

 

 

 

 

 

 

 

 

 

 

RNA and denovo transcriptomes

About me

We live in incredible times – where the genes expressed at a single moment can be captured, read and studied. I am a PhD student at the Faculty of Agriculture and Environment, University of Sydney. I am interested in the molecular responses of plants to pathogens.

I have had no prior experience running software in a Unix environment but decided that the tools are all out there and perhaps I could have a go. The last few months I have been working through an RNAseq pipeline and thought my failures and successes could help someone else who is starting out. I will blog each step and provide scripts that I used to run the software.

There are many different ways to go about analysing RNA data. I present one approach but welcome friendly criticism, constructive advice or comments. I am really learning on the run and there is a lot of free software available for different tasks – thanks to the many software developers.

The experiment

I am working with Australian Myrtaceae plants and investigating host responses to a newly arrived fungal pathogen, myrtle rust. Below is an electron scanning microscope image of a myrtle rust spore germinating on a eucalypt leaf made with the help of the Australian Microscopy & Microanalysis Research Facility (http://sydney.edu.au/acmm).

Sample4_002 - Copy

Just by way of some background, the pathogen has a very extensive Myrtaceae host-range and most Australian natural vegetation communities are composed of Myrtaceae plants. These include eucalypts, paperbark, bottle-brush, tea-tree, lemon-scented myrtle, waxflower and more…. The potential for devastating outbreaks under the right environmental conditions is therefore still a looming threat.

For my experiment I took leaf samples from 4 resistant  and 4 susceptible plants at three time points – pre-inoculation, 24 hours post inoculation and 48 hours post inoculation. I extracted and cleaned the RNA using commercial kits. I then made libraries of around 300 base pairs for paired-end Illumina sequencing on Hiseq2500. Two lanes for each sample were run. The raw data from these runs is where I will begin the bioinformatic explanation in my next post.