Hi,
I have a set of ~ 50 corrected PacBio reads (~2KB) which I want to map against a reference genome. These 50 reads should in theory represent just 50 individual locations on a genome. After that I want to extract variation between the individual read and the reference (so: depth = 1). I was wondering if I could use bwa-mem for mapping, but I don't have an idea of what tool I could use for long-read genotyping (format should be VCF). Of course I could also extract the variation based on the SAM file and convert it myself to a VCF format but I hope there's an easier solution ;)
Anyone who can help me?
Thanks!
How does BBMap handle illumina merged reads (PE reads that overlap) that are around 300bp? Have you tested?
BBMap handles merged reads fine. It's used routinely to map merged 2x150bp (270bp insert) reads to Ray assemblies, after using BBMerge to merge them. BBMap is highly tolerant of indels so even if they are merged with the wrong insert size (which happens a lot of you use Flash, due to its high false-positive rate) BBMap will still be able to map them.
For long PacBio reads you need to use
mapPacBio.sh
which is designed for a PacBio error profile and reads up to 6kbp; for merged Illumina reads, the appropriate program is "bbmap.sh" which is designed for an Illumina error profile and reads up to 600bp. Each calls a different version of BBMap.Thanks for your tip. By corrected I meanth that mutiple raw reads (~300X) per target have been aligned to a single consensus sequence. Although you don't recommend calling variants with depth 1, I do need this in a VCF format. Any recommendations of which tool to use?
Is this genome haploid, diploid, or higher?
As for tools - I suggest samtools mpileup, in general, but you may need to play with the parameters because many variant-callers will ignore things with depth-1 by default. But calling from 1x-depth consensus reads is only realistic for a haploid genome.
Thanks for the suggestion, I'll try that. As with the consensus reads: it does give back haplotypes, since it is indeed a diploid genome.
Late comment, because I am curious. I do not think samtools mpileup colud call structure variation right/ also do BBmap able to call SVs from long reads?
BBMap's CallVariants can call insertions (shorter than read length) and long deletions (very long) using long reads. It does not do other kinds of SV-calling though, like junctions, rearrangements, inversions, and CNV's, unless they can be represented as indels and substitutions.