Variant caller for low frequency variants
5
2
Entering edit mode
8.7 years ago
kulvait ▴ 270

Hi, I have amplicon sequencing data. I need some kind of variant caller which calls variants on low frequencies since the variants might be connected to cancer clones that are present in the samples in unknown frequencies. I have sequencing depth 500x-5000x.

I prefer the tool that can report every variant found even in one read that do not match reference sequence. I also prefer possibility to adjust filters to variant calls (i.e. minimum coverage, minimum number of supporting reads, minimum variant frequency in sequenced data).

Many variant callers use some heuristics and these parameters can not be controlled.

Thanks Vojtěch.

NGS • 8.3k views
ADD COMMENT
0
Entering edit mode

Hi, I don't have any matched normal samples and thus I can't use tools such as Strelka. I can only compare to reference genome. Your suggestions about using samtools mpileup or bam-readcount are probably in a good direction. However I would like my output to be vcf file. And I also would like to retain MNPs, large INDELs and complex mutations. I don't know how I would infer these complex mutations from samtools mpileup or bam-readcount outputs? Thanks, Vojtech.

ADD REPLY
1
Entering edit mode

hi, MuTect can be run on single sample as well. Check here. Though it is not that sensitive to report 1 read variants, but you would get VCF as out. For complex mut., I would suggest Lumpy or Delly. Both give VCFs as out and are able to run on single sample input (BAM)

ADD REPLY
4
Entering edit mode
8.7 years ago
Amitm ★ 2.3k

hi, Just low frequency is probably not a safe criteria. See this link here for a very interesting discussion on just picking up everything that is there Vs. balancing sensitivity with specificity. That said, I have found Strelka to be super sensitive (as noted in the above link). It probably picks out every loci which has some mismatch supporting reads. But it is interesting when that these "super-sensitive" calls many a times have multiple mutations in driver genes AND/or multiple driver genes mutated. Though I do not have any other evidence to refute such a scenario, it seems very less probable biologically. MuTect on the other hand also picked up low freq. var. but it never created scenario which Strelka seemed to create (multiple mut. in drivers).

Finally, I would like to add that if you want to pick up even 1 mismatch-supporting read, then probably parsing the pileup file is best bet. Its text format and the 5th col. can be parsed for mismatches found. Not sure if I have answered your question. I use VarScan though with combination of read-depth based filters to pick out low freq. var. But I don't go below 1% as there I found difficult to distinguish noise. The data I deal with have median depth of 3k-20k, depending on how many loci were amplified.

ADD COMMENT
1
Entering edit mode
8.7 years ago

Your problem isn't detection of single base changes - that's easy. You can easily grab that info with something like bam-readcount.

The real challenge is accurately quantifying the site and sequence-specific background mutation rate (above which a mutation must rise in order to be considered significant). Depending on how low-level you're really talking, it may even require something like error-corrected sequencing, where the input molecules are labeled with barcodes before amplification. (see haloplex, safeseq, about a half dozen other techniques).

ADD COMMENT
1
Entering edit mode
8.7 years ago

You could try LoFreq (depending on exactly what your needs are). It's worked well for me with viruses, for example.

ADD COMMENT
0
Entering edit mode
8.7 years ago

You can use FreeBayes and specify the frequency (-F) and count (-C) to output VCF containing all variants that meet your criteria.

ADD COMMENT
0
Entering edit mode
8.7 years ago
kulvait ▴ 270

Thank you for your suggestions. I have been using FreeBayes and did not obtain results I aimed for, however it is still #1 pick in terms of detecting complex mutations. However in terms of total control over my output I think I need really something as samtools mpileup and filtering. Only problem is that it does not work well with complex mutations. So far I have this pipeline to find and process low frequency variants starting with input.bam alignments

First step I use since I need to fix MD field in my bam files. That is not mandatory for most other uses since I did custom editing of alignments and masking of primer sequences prior to variant calling. I just changed CIGAR but not MD tag, this will fix that.

samtools calmd -rb file.bam ref.fa | sponge file.bam

Second step is actual mpileup call.

samtools mpileup -d 99999 -t INFO/AD -uf ref.fa file.bam > file.mpi

In the third step I use some filter but no bayesian or other approaches to variant calling. Just manual filtering of variants with coverage > 250 and frequency > 1%.

bcftools filter -i "AD[1]/DP > 0.01" file.mpi | bcftools filter -i "DP>250" | bcftools call -m -A | sed 's/,Version="3"//g' > file.vcf

Last sed is a heck to use next tool. I need to split genotypes for further filtering. Thus I employ utility called vcf_parser to do exactly that.

vcf_parser file.vcf --split | sponge file.vcf

Next I can perform second round of filtering with all alternate alleles. I also can annotate each allele separately.

bcftools filter -i "AD[1]/DP > 0.01" file.vcf | bcftools filter -i "AD[1] > 5" | sponge file.vcf

This is my current pipeline.

I would also like to merge output with Freebayes output. And in case there are complex variants detected by Freebayes to do some kind of merge to final vcf file. That would mean for each complex mutation found by Freebayes I need to find all SNP's and INDEL's in mpileup file that add up to the variant and remove them from former vcf and add in agregate variant from Freebayes but I don't know how to do that.

Vojtech.

ADD COMMENT

Login before adding your answer.

Traffic: 1365 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6