Canu unitigging/1-overlapper

Using Canu v1.6 this stage has taken 2 months so far with 351 jobs. There are 28 jobs still to run on the University of Sydney HPC, Artemis. Each job requires around 10GB memory, 4 cpus, and a progressive amount of walltime. Starting from only about 5 hours to about 150 hours of walltime for the last 100 jobs.

I have detailed this previously but the genome is predicted to be 1.3 or 1.4 Gbp and we started with 50-60X coverage of PacBio Sequel and RSII data.

The trimmedReads.fasta.gz output was 12G

canu script (NB the genome size was set to 1G for the script):

canu -p MR_1805a \
-d /scratch/directory/run-MR_1805a \
gnuplot=”/usr/local/gnuplot/5.0.0/bin/gnuplot” \
genomeSize=1g \
correctedErrorRate=0.040 \
batOptions=”-dg 3 -db 3 -dr 1 -ca 500 -cp 50″ \

A previous assembly was begun 6 months ago using the same raw data and is still incomplete. The same script was run except for an additional  parameter:

corOutCoverage=200

The trimmedReads.fasta.gz output from that assembly run was 17G

 

 

Making blast databases from fasta

makeblastdb -dbtype prot -in sequence_name.fasta

This script will return three files:

sequence_name.fasta.phr

sequence_name.fasta.pin

sequence_name.fasta.psq

You can then blastx your raw DNA reads against this database to pull out matches (with a custom script):

/project/directory/blast/launch_blast.py -1 RawReads_R1.fq -2 RawReads_R2.fq -b blastx -d /project/directory/blast/sequence_name.fasta

– before extracting matches based on significance (e-value, length, score). The extracted reads are then used to assemble the gene of interest ‘de novo’.

However, I recently tried running a script for blastx against newly made db files and kept returning the error:

BLAST Database error: No alias or index file found for protein database [/project/directory/sequence.fasta] in search path [/project/directory/::]

I was confused as this has been working up previously. After much wasted time I realised that I had changed my script to include the database name to:

makeblastdb -dbtype prot -in sequence_name.fasta -out sequence_name

This output db files named

sequence_name.phr

sequence_name.pin

sequence_name.psq

NB: The name of the database files did not include the .fasta

Hence blastx could not find my files.