Long read assembly planning

Okay so it has taken me some time to decipher the methods for assembling long read data for larger diploid, heterozygote organisms. I am starting to see through the fog.

It seems there are pretty much two assembler options for large genomes at present; Canu and Falcon.

http://canu.readthedocs.io/en/latest/quick-start.html

https://github.com/PacificBiosciences/FALCON/wiki/Setup%3A-Running

Both take input of .fasta files or (fastq), either zipped or unzipped.

The read data from PacBio sequencers is in .bam or .h5 format, as mentioned in the previous post, with .bam being the more standardized file format moving forward. The subread.bam files hold data from the first analysis (ie. raw read data from the sequencer that has had the adaptors removed). Whereas the scraps.bam files have error reads and adaptors.

The Dazzler blog, by Gene Myers from the Max Planck Institute for Molecular Cell Biology and Genetics,  is worth a read.

https://dazzlerblog.wordpress.com/2014/06/01/the-dazzler-db/

They have developed software designed to extract the sequence data and read quality from subread.bam files. This outputs two files; a fasta file and an arrow file (formerly quiver). This greatly reduces the file size for assembly (from 100% to 4% for the fasta). The assembly can then proceed with the fasta input files and the error correction can be done later using the arrow file. Also a database is formed that compresses the data in a format that can easily recreate the raw data files. I am greatly simplifying the process but this is the essential idea, I think.

I am yet to undertake any work on my data but now I have some idea where I am heading.

RS II reads and Sequel reads

Read outputs from these two generations of PacBio sequencers are different. This seems to be because they have adopted a recognised standard format for read files as .bam. So the RSII files include this format (bax.h5). An explanation of what the numbers/letters in each file name mean is available here;

https://github.com/PacificBiosciences/SMRT-Analysis/wiki/Data-files-you-received-from-your-service-provider

and a subset reposted here.

PacBio File Naming Convention

Many files and sequences used in primary and secondary analysis contain a long string identifier. Below is a typical ID string from a subread with fields defined:

 m140415_143853_42175_c100635972550000001823121909121417_s1_p0/553/3100_11230
└1┘└────2─────┘ └─3─┘ └────────────────4────────────────┘└5┘└6┘└7┘ └───8────┘
  1. m” = movie
  2. Time of Run Start (yymmdd_hhmmss)
  3. Instrument Serial Number
  4. SMRT Cell Barcode
  5. Set Number (a.k.a. “Look Number”. Deprecated field, used in earlier version of RS)
  6. Part Number (usually “p0“, “X0” when using expired reagents)
  7. ZMW hole number †
  8. Subread Region (start_stop using polymerase read coordinates) †

I have reads from both PacBio sequencers and therefore need to standardise to bam to assemble them all together. To convert the .h5 files to bam – you can use the code found here;

https://github.com/PacificBiosciences/FALCON/wiki/Convert-RSII-to-BAM

 

† Note that Fields 7 and 8 are used as sequence IDs in FASTA|FASTQ files. They are not used in filenames.

Starting the process of genome assembly with long reads

I have struggled to find a clean pipeline for de novo assembling long read data from large eukaryotes. I have reads from PacBio RS II as well as Sequel. RS II read files are .h5 and sequel files are .bam. An explanation of the bam format is here, it is not just an alignment format:

http://pacbiofileformats.readthedocs.io/en/5.0/BAM.html

At this stage I am planning to convert .h5 files to .bam and then use Falcon to assemble (hoping the coverge is enough) and then map short reads (Illumina) to the contigs – perhaps with Pilon? Early days so we will see what eventuates along the way. I will post script as I go.

https://github.com/PacificBiosciences/FALCON/wiki/Manual

https://github.com/broadinstitute/pilon/wiki