Using Canu v1.6 this stage has taken 2 months so far with 351 jobs. There are 28 jobs still to run on the University of Sydney HPC, Artemis. Each job requires around 10GB memory, 4 cpus, and a progressive amount of walltime. Starting from only about 5 hours to about 150 hours of walltime for the last 100 jobs.
I have detailed this previously but the genome is predicted to be 1.3 or 1.4 Gbp and we started with 50-60X coverage of PacBio Sequel and RSII data.
The trimmedReads.fasta.gz output was 12G
canu script (NB the genome size was set to 1G for the script):
canu -p MR_1805a \
-d /scratch/directory/run-MR_1805a \
batOptions=”-dg 3 -db 3 -dr 1 -ca 500 -cp 50″ \
A previous assembly was begun 6 months ago using the same raw data and is still incomplete. The same script was run except for an additional parameter:
The trimmedReads.fasta.gz output from that assembly run was 17G
makeblastdb -dbtype prot -in sequence_name.fasta
This script will return three files:
You can then blastx your raw DNA reads against this database to pull out matches (with a custom script):
/project/directory/blast/launch_blast.py -1 RawReads_R1.fq -2 RawReads_R2.fq -b blastx -d /project/directory/blast/sequence_name.fasta
– before extracting matches based on significance (e-value, length, score). The extracted reads are then used to assemble the gene of interest ‘de novo’.
However, I recently tried running a script for blastx against newly made db files and kept returning the error:
BLAST Database error: No alias or index file found for protein database [/project/directory/sequence.fasta] in search path [/project/directory/::]
I was confused as this has been working up previously. After much wasted time I realised that I had changed my script to include the database name to:
makeblastdb -dbtype prot -in sequence_name.fasta -out sequence_name
This output db files named
NB: The name of the database files did not include the .fasta
Hence blastx could not find my files.