As mentioned previously, the raw reads from RSII and Sequel PacBio come as bax.h5 and subreads.bam respectively. Both the Falcon and Canu assembler take input files of Fasta or Fastq.
After reading a bit more we realized that we needed to extract the fasta as well as the arrow files for assembly and then polishing. Dextract from the Dazzler suite is the software to extract and make these files from the bax.h5 and subreads.bam.
The following explanation is what happened when we attempted to install dextract via the Falcon-integrate onto the hpc.
Found DEXTRACTOR was not on the HPC – thought it may have been installed as part of FALCON, which is there, but this was not the case. We then decided to install FALCON-integrate which includes DEXTRACTOR. We followed the intructions here:
export GIT_SYM_CACHE_DIR=~/.git-sym-cache # to speed things up
git checkout master # or whatever version you want
git submodule update –init # Note: You must do this yourself! No longer via `make init`.
make -j all
make test # to run a simple one
Needed to run “module load python” to get the correct python version (2.7.9.)
Found that dextract did not build. This was fixed by changing the hdf5 include and lib path in the makefile (DEXTRACTOR/Makefile)
Then found FALCON-integrate doesn’t have the later version of DEXTRACTOR which can work with .bam files, so we decided to do a stand alone install of the latest DEXTRACTOR.
git checkout master
make -f Makefile
The makefile needed to be edited to work with different zlib and hdf5 include and lib paths
(possibly could have avoided this by running “module load hdf5/1.8.16” and “module load zlib” beforehand)
Need to run
module load hdf5/1.8.16
module load zlib
~/bin/DEXTRACTOR/dextract -f -a m54078_170627_071131.subreads.bam
generates a m54078_170627_071131.fasta and a m54078_170627_071131.arrow in the current directory
-f outputs a fasta file
-a outputs an arrow file