Question

Long read-alignment + variant calling

3

Entering edit mode

10.2 years ago

Coryza ▴ 430

Hi,

I have a set of ~ 50 corrected PacBio reads (~2KB) which I want to map against a reference genome. These 50 reads should in theory represent just 50 individual locations on a genome. After that I want to extract variation between the individual read and the reference (so: depth = 1). I was wondering if I could use bwa-mem for mapping, but I don't have an idea of what tool I could use for long-read genotyping (format should be VCF). Of course I could also extract the variation based on the SAM file and convert it myself to a VCF format but I hope there's an easier solution ;)

Anyone who can help me?

Thanks!

long-reads mapping genotyping • 5.7k views

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 10.2 years ago by Coryza ▴ 430

Ram · Answer 1 · 2015-02-13

1

Entering edit mode

10.2 years ago

zhan2142 ▴ 10

I use BLASR to map PacBio reads to the reference.

And I also tried to run freeBayes to identify the variant. That was a disaster. I am not saying it can't identify SNPs, but it took way long to finish. Not recommended.

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.2 years ago by zhan2142 ▴ 10

Ram · Answer 2 · 2015-02-13

1

Entering edit mode

10.2 years ago

Brian Bushnell 20k

I use BBMap for PacBio reads; it's a lot easier to use then BLASR and yields better alignments. I would not recommend calling variants from 1x-coverage PacBio data, corrected or not. That said, what do you mean by corrected? Reads of Insert, or something else?

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.2 years ago by Brian Bushnell 20k

0

Entering edit mode

How does BBMap handle illumina merged reads (PE reads that overlap) that are around 300bp? Have you tested?

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.2 years ago by apelin20 ▴ 490

0

Entering edit mode

BBMap handles merged reads fine. It's used routinely to map merged 2x150bp (270bp insert) reads to Ray assemblies, after using BBMerge to merge them. BBMap is highly tolerant of indels so even if they are merged with the wrong insert size (which happens a lot of you use Flash, due to its high false-positive rate) BBMap will still be able to map them.

For long PacBio reads you need to use mapPacBio.sh which is designed for a PacBio error profile and reads up to 6kbp; for merged Illumina reads, the appropriate program is "bbmap.sh" which is designed for an Illumina error profile and reads up to 600bp. Each calls a different version of BBMap.

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.2 years ago by Brian Bushnell 20k

0

Entering edit mode

Thanks for your tip. By corrected I meanth that mutiple raw reads (~300X) per target have been aligned to a single consensus sequence. Although you don't recommend calling variants with depth 1, I do need this in a VCF format. Any recommendations of which tool to use?

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.2 years ago by Coryza ▴ 430

0

Entering edit mode

Is this genome haploid, diploid, or higher?

As for tools - I suggest samtools mpileup, in general, but you may need to play with the parameters because many variant-callers will ignore things with depth-1 by default. But calling from 1x-depth consensus reads is only realistic for a haploid genome.

ADD REPLY • link updated 3.0 years ago by Ram 45k • written 10.2 years ago by Brian Bushnell 20k

0

Entering edit mode

Thanks for the suggestion, I'll try that. As with the consensus reads: it does give back haplotypes, since it is indeed a diploid genome.

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.2 years ago by Coryza ▴ 430

0

Entering edit mode

Late comment, because I am curious. I do not think samtools mpileup colud call structure variation right/ also do BBmap able to call SVs from long reads?

ADD REPLY • link 7.6 years ago by Medhat 9.8k

0

Entering edit mode

BBMap's CallVariants can call insertions (shorter than read length) and long deletions (very long) using long reads. It does not do other kinds of SV-calling though, like junctions, rearrangements, inversions, and CNV's, unless they can be represented as indels and substitutions.

ADD REPLY • link 7.6 years ago by Brian Bushnell 20k