All posts by peritob

Polishing Canu genome

Having assembled the genome using Canu it is time to polish. This can be done with Pilon, using Illumina reads aligned to the genome. This can also be a time and compute heavy job (days). For example:

PBS -l select=1:ncpus=8:mem=160gb
PBS -l walltime=64:00:00

Now doing the polishing again. This time aligned the raw subread.bam files with the genome using pbalign from SMRT analysis tools. A good explanation was found here after much confusion.

https://www.biostars.org/p/273447/

The command to make an xml dataset (of all the subread.bam files for pbalign is this below. Note the subreads.bam files are from the sequel data (one folder) and RSII data (another folder):

~/bin/smrtlink/smrtcmds/bin/dataset create –type SubreadSet –name fungi  fungi_set.xml ../RSII/*.subreads.bam ../sequel/*.subreads.bam

The main confusing bit is the name ‘bam’. Here we have pre-aligned raw bams (subread.bams). Then, after aligning using pbalign, we have aligned bams. These aligned reads are the ones to use in the polishing with arrow from SMRT-tools.

http://www.pacb.com/support/software-downloads/

 

 

Promising Canu output

After nearly 2 and a half months of tweaking, restarting we have an assembled genome. It came in at 1.37 G assembled bases, 19771 contigs (a bit high I guess but with a coverage of around 21X after trimming and correction not too bad) and an NG50 of 134 701.

Busco analysis results were also not too bad for a first assembly using 1335 conserved taxa genes.

Results:
C:76.7%[S:47.9%,D:28.8%],F:7.6%,M:15.7%,n:1335
1024 Complete BUSCOs (C)
640 Complete and single-copy BUSCOs (S)
384 Complete and duplicated BUSCOs (D)
101 Fragmented BUSCOs (F)
210 Missing BUSCOs (M)
1335 Total BUSCO groups searched
BUSCO analysis done. Total running time: 5979.202003479004 seconds

 

I think that the duplicated genes, and maybe the fragmented ones too, is due to heterozygosity. This will have inflated the genome size too. This means that with revised parameters for assembly – to phase the haplotypes – we might get a more realistic assembly.

More work to do but I have heard a few bad comments about Canu and do not think that they are warranted. Time for the assembly was approximately 3 weeks once we ironed out the grid options for different stages.

K-mer analysis and genome size estimation, bridges and Euler

Determining the approximate genome size before you begin expensive PacBio sequencing projects is important.

If Illumina reads are available for the organism, a kmer analysis can estimate the genome size.

There is an excellent tutorial from the University of Connecticut explaining the step by step process:

http://bioinformatics.uconn.edu/genome-size-estimation-tutorial/

The seven bridges of Königsberg problem, and Euler’s solution (1736), is the root of the assembly algorithms for short reads. De Bruijn graph theory developed from Euler’s solution that every land mass is a ‘node’ and every bridge is an ‘edge’.

https://en.wikipedia.org/wiki/Seven_Bridges_of_K%C3%B6nigsberg

In assembling a genome using de Bruijn graph algorithms, the edges are kmers and the nodes are alignments. Explained very nicely here:

Click to access DeBruijnPrimer.pdf

 

 

Further canu developments

While tweaking the memory and wall-time for the compute cluster Mhap job (correction and overlapping algorithm) progressed things slightly – the job still failed inexplicably.

After many trials it appears that the actual project directory did not have the disk space capacity for the very large (several terabytes) intermediate files created by Mhap. It seems a pity that neither the HPC nor the software alerted us to this issue – hence many wasted attempts and several weeks of trying.

We then ran the job within scratch directory, which has around 200T disk space available, and finally the Mhap job completed successfully and the next stage began. This is the other correction and trimming algorithm of Canu and is called Ovl (returns alignments). Now trying to work out why this section only partially completes. We have upped the memory on this stage to 10G from the default 5G and it is now running – so I guess we will see in a few days if it worked.

#PBS -P project directory
#PBS -N canu
#PBS -l select=1:ncpus=2:mem=64GB
#PBS -l walltime=00:30:00
#PBS -e ./canu_error.txt
#PBS -o ./canu_output.txt
#PBS -M email@sydney.edu.au
#PBS -m abe

# Load modules
module load gnuplot
module load perl
module load java
module load canu

# Working directory
cd $PBS_O_WORKDIR

canu -p name -d /scratch/directory/run-1510 genomeSize=1000m gridOptions=”-P project directory” gridOptionsCORMHAP=”-l mem=60g -l walltime=16:00:00″ gridOptionsCOROVL=”-l mem=10g” gnuplot=”/usr/local/gnuplot/5.0.0/bin/gnuplot” -pacbio-raw ../RSII/*.fasta ../sequel/*.fasta

More problems with Canu assembly

The most recent iteration of scripts for running a eukaryotic assembly is below:

#PBS -P RDS-project name
#PBS -N canu
#PBS -l select=2:ncpus=4:mem=64GB
#PBS -l walltime=8:00:00
#PBS -e ./canu_error.txt
#PBS -o ./canu_output.txt
#PBS -M name@sydney.edu.au
#PBS -m abe

# Load modules
module load gnuplot
module load perl
module load java
module load canu

# Working directory
cd $PBS_O_WORKDIR

canu -p name -d run01 genomeSize=700m gridOptions=”-P RDS-project name” gnuplot=”/usr/local/gnuplot/5.0.0/bin/gnuplot” -pacbio-raw ../RSII/*.fasta ../sequel/*.fasta

This time there appears to be a problem with MHap. Results so far:

Parameters:

— genomeSize 700000000

— Overlap Generation Limits:
— corOvlErrorRate 0.2400 ( 24.00%)
— obtOvlErrorRate 0.0450 ( 4.50%)
— utgOvlErrorRate 0.0450 ( 4.50%)

— Overlap Processing Limits:
— corErrorRate 0.3000 ( 30.00%)
— obtErrorRate 0.0450 ( 4.50%)
— utgErrorRate 0.0450 ( 4.50%)
— cnsErrorRate 0.0750 ( 7.50%)


— BEGIN CORRECTION

— Meryl finished successfully.
— Finished stage ‘merylCheck’, reset canuIteration.

— 16-mers Fraction
— Occurrences NumMers Unique Total
— 1- 1 309894793 ******************************************************************** 0.1680 0.0086
— 2- 2 239928878 ***************************************************** 0.2980 0.0219
— 3- 4 315390897 ********************************************************************** 0.3951 0.0368
— 5- 7 264828389 ********************************************************** 0.5270 0.0668
— 8- 11 195113457 ******************************************* 0.6451 0.1083
— 12- 16 138851914 ****************************** 0.7369 0.1569
— 17- 22 97877726 ********************* 0.8044 0.2080
— 23- 29 69227532 *************** 0.8531 0.2585
— 30- 37 49523175 ********** 0.8882 0.3064
— 38- 46 35991767 ******* 0.9135 0.3508
— 47- 56 26584848 ***** 0.9321 0.3915
— 57- 67 19941878 **** 0.9460 0.4285
— 68- 79 15186787 *** 0.9565 0.4620
— 80- 92 11719981 ** 0.9644 0.4923
— 93- 106 9169068 ** 0.9706 0.5198
— 107- 121 7265393 * 0.9755 0.5447
— 122- 137 5816829 * 0.9794 0.5673
— 138- 154 4710287 * 0.9825 0.5880
— 155- 172 3852062 0.9850 0.6068
— 173- 191 3168774 0.9870 0.6241
— 192- 211 2637308 0.9887 0.6399
— 212- 232 2210159 0.9901 0.6546
— 233- 254 1863344 0.9913 0.6681
— 255- 277 1581889 0.9923 0.6806
— 278- 301 1349585 0.9932 0.6921
— 302- 326 1160541 0.9939 0.7029
— 327- 352 1005750 0.9945 0.7130
— 353- 379 875545 0.9951 0.7224
— 380- 407 769098 0.9955 0.7312
— 408- 436 680141 0.9960 0.7396
— 437- 466 602232 0.9963 0.7475
— 467- 497 535033 0.9966 0.7550
— 498- 529 474693 0.9969 0.7622
— 530- 562 422292 0.9972 0.7689
— 563- 596 377700 0.9974 0.7753
— 597- 631 337614 0.9976 0.7813
— 632- 667 303565 0.9978 0.7871
— 668- 704 273930 0.9980 0.7925
— 705- 742 249436 0.9981 0.7977
— 743- 781 226781 0.9983 0.8027
— 782- 821 205517 0.9984 0.8075

— 16825801 (max occurrences)
— 35749081594 (total mers, non-unique)
— 1535091669 (distinct mers, non-unique)
— 309894793 (unique mers)
— For mhap overlapping, set repeat k-mer threshold to 360589.

— Found 36058976387 16-mers; 1844986462 distinct and 309894793 unique. Largest count 16825801.
— Finished stage ‘cor-meryl’, reset canuIteration.

— OVERLAPPER (mhap) (correction)

— Set corMhapSensitivity=normal based on read coverage of 51.

— PARAMETERS: hashes=512, minMatches=3, threshold=0.78

— Given 30 GB, can fit 90000 reads per block.
— For 31 blocks, set stride to 7 blocks.
— Logging partitioning to ‘correction/1-overlapper/partitioning.log’.
— Configured 30 mhap precompute jobs.
— Configured 76 mhap overlap jobs.
— Finished stage ‘cor-mhapConfigure’, reset canuIteration.

— Running jobs. First attempt out of 2.

— ‘precompute.jobSubmit-01.sh’ -> job 1630049[].pbsserver tasks 1-30.

—————————————-
— Starting command on Wed Sep 20 15:10:37 2017 with 7263.768 GB free disk space

cd /projectdirectory
qsub \
-j oe \
-W depend=afterany:1630049[].pbsserver \
-l mem=4g \
-l nodes=1:ppn=1 \
-P project name
-N ‘canu_name’ \
-o canu-scripts/canu.02.out canu-scripts/canu.02.sh
1630050.pbsserver

— Finished on Wed Sep 20 15:10:37 2017 (lickety-split) with 7263.768 GB free disk space

Still trying to work out what to do now…

Assembly with Canu

We have begun testing the extracted .fasta files from both RSII and Sequel using the Canu assembly software.

The latest version of Canu is v1.6 (released on the 15 August 2017)

https://github.com/marbl/canu/releases

and the associated publication is;

We installed locally to the login at the University hpc as the currently installed version on the hpc is v1.3.

To run we used these specifications.

#PBS -l select=1:ncpus=24:mem=128GB+8:ncpus=4:mem=32GB
#PBS -l walltime=48:00:00

# Load modules
module load perl
module load java

# Working directory
cd /project/…

/home/canu/canu-1.6/Linux-amd64/bin/canu -p  name genomeSize=700m -pacbio-raw ../RSII/*.fasta ../sequel/*.fasta

Now it is just a matter of waiting to see how it goes. Probably need a few more parameters but this will be the first test.

False starts with PacBio reads

As mentioned previously, the raw reads from RSII and Sequel PacBio come as bax.h5 and subreads.bam respectively. Both the Falcon and Canu assembler take input files of Fasta or Fastq.

After reading a bit more we realized that we needed to extract the fasta as well as the arrow files for assembly and then polishing. Dextract from the Dazzler suite is the software to extract and make these files from the bax.h5 and subreads.bam.

The following explanation is what happened when we attempted to install dextract via the Falcon-integrate onto the hpc.

Found DEXTRACTOR was not on the HPC – thought it may have been installed as part of FALCON, which is there, but this was not the case. We then decided to install FALCON-integrate which includes DEXTRACTOR. We followed the intructions here:
export GIT_SYM_CACHE_DIR=~/.git-sym-cache # to speed things up
cd FALCON-integrate
git checkout master  # or whatever version you want
git submodule update –init # Note: You must do this yourself! No longer via `make init`.
make init
source env.sh
make config-edit-user
make -j all
make test  # to run a simple one
Needed to run “module load python” to get the correct python version (2.7.9.)
Found that dextract did not build. This was fixed by changing the hdf5 include and lib path in the makefile (DEXTRACTOR/Makefile)
Then found FALCON-integrate doesn’t have the later version of DEXTRACTOR which can work with .bam files, so we decided to do a stand alone install of the latest DEXTRACTOR.
cd DEXTRACTOR
git checkout master
make -f Makefile
The makefile needed to be edited to work with different zlib and hdf5 include and lib paths
(possibly could have avoided this by running “module load hdf5/1.8.16” and “module load zlib” beforehand)
Running dextract
Need to run
module load hdf5/1.8.16
module load zlib
beforehand
Example:
~/bin/DEXTRACTOR/dextract -f -a m54078_170627_071131.subreads.bam
generates a m54078_170627_071131.fasta and a m54078_170627_071131.arrow in the current directory
-f outputs a fasta file
-a outputs an arrow file