K-mer analysis and genome size estimation, bridges and Euler

Determining the approximate genome size before you begin expensive PacBio sequencing projects is important.

If Illumina reads are available for the organism, a kmer analysis can estimate the genome size.

There is an excellent tutorial from the University of Connecticut explaining the step by step process:

http://bioinformatics.uconn.edu/genome-size-estimation-tutorial/

The seven bridges of Königsberg problem, and Euler’s solution (1736), is the root of the assembly algorithms for short reads. De Bruijn graph theory developed from Euler’s solution that every land mass is a ‘node’ and every bridge is an ‘edge’.

https://en.wikipedia.org/wiki/Seven_Bridges_of_K%C3%B6nigsberg

In assembling a genome using de Bruijn graph algorithms, the edges are kmers and the nodes are alignments. Explained very nicely here:

Click to access DeBruijnPrimer.pdf

 

 

Further canu developments

While tweaking the memory and wall-time for the compute cluster Mhap job (correction and overlapping algorithm) progressed things slightly – the job still failed inexplicably.

After many trials it appears that the actual project directory did not have the disk space capacity for the very large (several terabytes) intermediate files created by Mhap. It seems a pity that neither the HPC nor the software alerted us to this issue – hence many wasted attempts and several weeks of trying.

We then ran the job within scratch directory, which has around 200T disk space available, and finally the Mhap job completed successfully and the next stage began. This is the other correction and trimming algorithm of Canu and is called Ovl (returns alignments). Now trying to work out why this section only partially completes. We have upped the memory on this stage to 10G from the default 5G and it is now running – so I guess we will see in a few days if it worked.

#PBS -P project directory
#PBS -N canu
#PBS -l select=1:ncpus=2:mem=64GB
#PBS -l walltime=00:30:00
#PBS -e ./canu_error.txt
#PBS -o ./canu_output.txt
#PBS -M email@sydney.edu.au
#PBS -m abe

# Load modules
module load gnuplot
module load perl
module load java
module load canu

# Working directory
cd $PBS_O_WORKDIR

canu -p name -d /scratch/directory/run-1510 genomeSize=1000m gridOptions=”-P project directory” gridOptionsCORMHAP=”-l mem=60g -l walltime=16:00:00″ gridOptionsCOROVL=”-l mem=10g” gnuplot=”/usr/local/gnuplot/5.0.0/bin/gnuplot” -pacbio-raw ../RSII/*.fasta ../sequel/*.fasta

More problems with Canu assembly

The most recent iteration of scripts for running a eukaryotic assembly is below:

#PBS -P RDS-project name
#PBS -N canu
#PBS -l select=2:ncpus=4:mem=64GB
#PBS -l walltime=8:00:00
#PBS -e ./canu_error.txt
#PBS -o ./canu_output.txt
#PBS -M name@sydney.edu.au
#PBS -m abe

# Load modules
module load gnuplot
module load perl
module load java
module load canu

# Working directory
cd $PBS_O_WORKDIR

canu -p name -d run01 genomeSize=700m gridOptions=”-P RDS-project name” gnuplot=”/usr/local/gnuplot/5.0.0/bin/gnuplot” -pacbio-raw ../RSII/*.fasta ../sequel/*.fasta

This time there appears to be a problem with MHap. Results so far:

Parameters:

— genomeSize 700000000

— Overlap Generation Limits:
— corOvlErrorRate 0.2400 ( 24.00%)
— obtOvlErrorRate 0.0450 ( 4.50%)
— utgOvlErrorRate 0.0450 ( 4.50%)

— Overlap Processing Limits:
— corErrorRate 0.3000 ( 30.00%)
— obtErrorRate 0.0450 ( 4.50%)
— utgErrorRate 0.0450 ( 4.50%)
— cnsErrorRate 0.0750 ( 7.50%)


— BEGIN CORRECTION

— Meryl finished successfully.
— Finished stage ‘merylCheck’, reset canuIteration.

— 16-mers Fraction
— Occurrences NumMers Unique Total
— 1- 1 309894793 ******************************************************************** 0.1680 0.0086
— 2- 2 239928878 ***************************************************** 0.2980 0.0219
— 3- 4 315390897 ********************************************************************** 0.3951 0.0368
— 5- 7 264828389 ********************************************************** 0.5270 0.0668
— 8- 11 195113457 ******************************************* 0.6451 0.1083
— 12- 16 138851914 ****************************** 0.7369 0.1569
— 17- 22 97877726 ********************* 0.8044 0.2080
— 23- 29 69227532 *************** 0.8531 0.2585
— 30- 37 49523175 ********** 0.8882 0.3064
— 38- 46 35991767 ******* 0.9135 0.3508
— 47- 56 26584848 ***** 0.9321 0.3915
— 57- 67 19941878 **** 0.9460 0.4285
— 68- 79 15186787 *** 0.9565 0.4620
— 80- 92 11719981 ** 0.9644 0.4923
— 93- 106 9169068 ** 0.9706 0.5198
— 107- 121 7265393 * 0.9755 0.5447
— 122- 137 5816829 * 0.9794 0.5673
— 138- 154 4710287 * 0.9825 0.5880
— 155- 172 3852062 0.9850 0.6068
— 173- 191 3168774 0.9870 0.6241
— 192- 211 2637308 0.9887 0.6399
— 212- 232 2210159 0.9901 0.6546
— 233- 254 1863344 0.9913 0.6681
— 255- 277 1581889 0.9923 0.6806
— 278- 301 1349585 0.9932 0.6921
— 302- 326 1160541 0.9939 0.7029
— 327- 352 1005750 0.9945 0.7130
— 353- 379 875545 0.9951 0.7224
— 380- 407 769098 0.9955 0.7312
— 408- 436 680141 0.9960 0.7396
— 437- 466 602232 0.9963 0.7475
— 467- 497 535033 0.9966 0.7550
— 498- 529 474693 0.9969 0.7622
— 530- 562 422292 0.9972 0.7689
— 563- 596 377700 0.9974 0.7753
— 597- 631 337614 0.9976 0.7813
— 632- 667 303565 0.9978 0.7871
— 668- 704 273930 0.9980 0.7925
— 705- 742 249436 0.9981 0.7977
— 743- 781 226781 0.9983 0.8027
— 782- 821 205517 0.9984 0.8075

— 16825801 (max occurrences)
— 35749081594 (total mers, non-unique)
— 1535091669 (distinct mers, non-unique)
— 309894793 (unique mers)
— For mhap overlapping, set repeat k-mer threshold to 360589.

— Found 36058976387 16-mers; 1844986462 distinct and 309894793 unique. Largest count 16825801.
— Finished stage ‘cor-meryl’, reset canuIteration.

— OVERLAPPER (mhap) (correction)

— Set corMhapSensitivity=normal based on read coverage of 51.

— PARAMETERS: hashes=512, minMatches=3, threshold=0.78

— Given 30 GB, can fit 90000 reads per block.
— For 31 blocks, set stride to 7 blocks.
— Logging partitioning to ‘correction/1-overlapper/partitioning.log’.
— Configured 30 mhap precompute jobs.
— Configured 76 mhap overlap jobs.
— Finished stage ‘cor-mhapConfigure’, reset canuIteration.

— Running jobs. First attempt out of 2.

— ‘precompute.jobSubmit-01.sh’ -> job 1630049[].pbsserver tasks 1-30.

—————————————-
— Starting command on Wed Sep 20 15:10:37 2017 with 7263.768 GB free disk space

cd /projectdirectory
qsub \
-j oe \
-W depend=afterany:1630049[].pbsserver \
-l mem=4g \
-l nodes=1:ppn=1 \
-P project name
-N ‘canu_name’ \
-o canu-scripts/canu.02.out canu-scripts/canu.02.sh
1630050.pbsserver

— Finished on Wed Sep 20 15:10:37 2017 (lickety-split) with 7263.768 GB free disk space

Still trying to work out what to do now…