complex indel precludes detection of overlapping SNPs
0
0
Entering edit mode
7.8 years ago

[Disclaimer: I cross-posted on the FreeBayes Google group but it doesn't seem to get much activity]

I'm trying to call low-frequency variants in a pooled sample (C. elegans WGS, 20X coverage, aligned with BBMap using 'sam=1.3' flag ) and encountered an unexpected problem - which usually means I'm doing something wrong :-). Here's the command:

freebayes -f REFERENCE.FA -F 0.01 -C 1 DATA_SORTED.BAM > DATA.VCF

The VCF contains a number of long (e.g., 10K) complex deletions, which are supported by a few (2-3) lower map-quality (2-18) reads. The deletions overlap SNPs that are supported by more (8-12) higher quality (>20) reads. The problem is that the complex indel is called but not the SNPs, even though the data provide superior support for the SNP call. One example using SAMtools 'tview' is shown below (reads supporting complex deletion in green/blue; asterisk=deletion).

I realize that I can filter on map quality but I'm trying to maximize sensitivity to true-positive SNP calls (which I can validate by prior annotation), and filtering reduces sensitivity. I've tried using the prior annotation VCF (-@ flag) but that didn't help, and the '--no-complex flag' just changed the TYPE to del. And obviously post-VCF filtering cannot recover variant calls that are missing.

Should I be using a different set of criteria/flags to call these overlapping SNPs?

TVIEW_image

variant-calling FreeBayes pooled sample • 2.7k views
ADD COMMENT
0
Entering edit mode

I'd be interested in the results if you try BBMap's variant-caller; it does not care whether some variations are masked by others - they are all called independently. For low-frequency variants, down to 1%, the command would be:

callvariants.sh in=mapped.sam ref=ref.fa vcf=vars.vcf ploidy=2 rarity=0.01

The "rarity=0.01" flag means variants will not have their score penalized based on low allele fraction unless the allele fraction drops below 0.01, so basically, it allows detection of variants down to 1% which would otherwise be discarded. Note that CallVariants requires either mapping with "sam=1.4" or with MD tags ("mdtag=t"); I don't think it works with "sam=1.3" and no MD tags although I'm planning to allow that. Also, if you use it, I suggest the latest version (37.00) since I've made changes recently.

ADD REPLY
0
Entering edit mode

@Brian, I was planning to test BBMap (I'm actually benchmarking different VC tools). It may take a few days - I'll keep you posted. But I know that FreeBayes has been used for low-frequency SNP detection, so I'm hoping someone has worked out the solution already.

ADD REPLY
0
Entering edit mode

OK, I'll look forward to seeing the results! I've only compared it to samtools/GATK/bcftools so far.

ADD REPLY
0
Entering edit mode

Hi Brian,

I haven't finished all of the benchmarks yet, but thought I'd give you an update.

The data set is from two C. elegans strains, WT N2 and Hawaiian CB4856 (highly polymorphic), sequenced separately. Real data, SE-49bp, 10X genomes equivalent each.

For the other variant callers, I aligned with BBMap v34.08 with flags 'sam=1.3 ambiguous=toss strictmaxindel=1000' (I can explain the rationale if interested). I tested FreeBayes, SAMtools/BCFtools, and VarScan2 with mostly default parameters (I can provide if interested). I decomposed FreeBayes MNPs into SNPs with vcflib.

For callvariants.sh, I aligned with BBMap v37.00 with flag 'ambiguous=toss', then called variants with flags 'ploidy 2 rarity=0.01' per your suggestion.

For each VCF, I filtered to keep only SNP calls, then used BEDtools intersect with a list of 103326 known Hawaiian SNPs for validation.

Here are the numbers:

FreeBayes: 102083 SAMtools/BCFtools: 100509 VarScan2: 80511 BBMap: 78022

The relative numbers are also proportional to the total number of unvalidated variants called - not surprising, since 1) the majority of Hawaiian polymorphisms are SNPs and 2) SE-49bp data is not ideal for calling other types of variants.

I'm planning to rerun on a different data set that's only 5% Hawaiian. I'll let you know how that turns out. I'm also happy to try other parameters to improve calling with BBMap (I've invested a fair amount of time tweaking the other callers to optimize results).

Take care, Harold

ADD REPLY
0
Entering edit mode

Adding GATK: 97378.

-Harold

ADD REPLY
0
Entering edit mode

Hi Harold,

To be clear, does this mean that the closer a caller's result is to 103326, the better it is? Also, does this data capture false positives?

And, of course, if the data is not confidential, I'd love to get my hands on it. I'm currently working with a human sample, NA12878, which is presumably the best-studied. But wow, it takes so long to do mapping because the genome is so big. And, calling variants is harder because the genome is more complex and repetitive. So I am working on calibrating CallVariants on human data, but the turnaround time is very slow. If you have validated variants on a small genome, and are at liberty to release the fastqs and variants, please let me know!

ADD REPLY
0
Entering edit mode

Yes, the closer to 103326 the better for true-positives. But the validated SNPs are only a subset of the true variants (~1/3 of the total), so I don't have accurate metrics on false-positives. C. elegans is really useful for testing (small 100Mb genome, complete assembly, only 5% repetitive by short-read data). I'm happy to share the data that I have (real as well as simulated reads with variants), because I really appreciate the tools that you've developed. PM me and we'll discuss further (username at gmail).

ADD REPLY

Login before adding your answer.

Traffic: 1624 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6