varscan misses alternate variant alleles
2
0
Entering edit mode
8.0 years ago
mark.rose ▴ 50

Hi All

When using varscan (v2.4.0) to call variants on Illumina sequenced amplicons, I have a sample that contains reads where there is a mix of an insertion and a deletion at the same position. The problem is varscan only detects the deletion.

java -jar VarScan.v2.4.0.jar mpileup2cns my_sample.mpileup --strand-filter 0 --min-var-freq 0.10 --min-avg-qual 0 --min-coverage 1 --min-reads2 1 --output-vcf 1 --variants 1 > my_sample.vcf

here is the relevant position in the mpileup input to varscan

ZmpylD  177     t       193     .-2CG.-2CG.-2CG.-2CG.+23CGTCGACGTCCCCGACGGCAACA.-2CG.-2CG.+23CGTCGACGTACCCGACGGCAACA.+23CGTCGACGTCCCCG
ACGGCAACA.-2CG.-2CG.-2CG.-2CG.-2CG.-2CG.-2CG..-2CG.-2CG.-2CG.-2CG.-2CG.-2CG.+23CGTCGACGTCCCCGACGGCAACAA.+23CGTCGACGTCCCCGACGGCAACA.-2C
G.+23CGTCGACGTCCCCGACGGCAACA.-2CG.-2CG.+23CGTCGACGTCCCCGACGGCAACA.-2CG.-2CG.-2CG.+23CGTCGACGTCCCCGACGGCAACA.-2CG.-2CG.-2CG.+23CGTCGACG
TCCCCGACGGCAACA.-2CG.-2CG.-2CG.-2CG.-2CG.-2CG.-2CG.-2CG.+23CGTCGACGTCCCCGACGGCAACA.-2CG.-2CG.-2CG.+23CGTCGACGTCCCCGACGGCAACA.-2CG.-2CG
.-2CG.+23CGTCGACGTCCCCGACGGCAACA.+23CGTCGACGTCCCCGACGGCAACA.-2CG.-2CG.+23CGTCGACGTCCCCGACGGCAACA.+23CGTCGACGTCCCCGACGGCAACA.-2CG.+23CG
TCGACGTCCCCGACGGCAACA.-2CG.-2CG.-2CG.-2CG.+23CGTCG
ACGTCCCCGACGGCAACA.-2CG.-2CG.-2CG.-2CG.+23CGTCGACGTCCCCGACGGCAACA.-2CG.+23CGTCGACGTCCCCGACGGCAACA.+23CGTCGACGTCCCCGACGGCCACA.-2CG.+23CGTCGACGTCCCCGACGGCAACA.+23CGTCGACGTCCCCGACGGCAACA.-2CG.-2CG.-2CG.-2CG.+23CGTCGACGTCCCCGACGGCAACA.-2CG.+23CGTCGACGTCCCCGACGGCAACA.-2CG.-2CG.-2CG.-2CG.-2CG.+23CGTCGACGTCCCCGACGGCAACA.+23CGTCGACGTCCCCGACGGCAACA.-2CG.-2CG.-2CG.-2CG..-2CG.+23CGTCGACGTCCCCGACGGCAACA.-2CG.-2CG.-2CG.-2CG.-2CG.+23CGTCGACGTCCCCGACGGCAACA.-2CG.-2CG.-2CG.+23TGTCGACGTCCCCGACGGCAACA.-2CG.-2CG.-2CG.-2CG.-2CG.+23CGTCGACGTCCCCGACGGCAACA.-2CG.+23CGTCGACGTCCCCGACGGCAACA.-2CG.-2CG.-2CG.-2CG.-2CG.+23CGTCGACGTCCCCGACGGCAACA.+23CGTCGACGTCCCCGACGGCAACA.-2CG.-2CG.+23CGTCGACGTCCCCGACGGCAACA.-2CG.-2CG.+23CGTCGACGTCCCCGACGGCAACA.+23CGTCGACGTCCCCGACGGCAACA.+23CGTCGACGTCCCCGACGGCAACA.-2CG.-2CG.+23CGTCTACGTCCCCGACGGCAACA.+23CGTCGACGTCCCCGACGGCAACA.-2CG.+23CGTCGACGTCCCCGACGGCAACA.-2CG.+23CGTCGACGTCCCCGACGGCAACA.+23CGTCGACGTCCCCGACGGCAACA.-2CG.-2CG.+23CGTCGACGTCCCCGACGGCAACA.-2CG.-2CG.-2CG.-2CG.-2CG.+23CGTCGACGTCCCCGACGGCAACA.+23CGTCGACGTCCCCGACGGCAACA.+23CGTCGACGTCCCCGACGGCAACA.-2CG.-2CG.-2CG.-2CG.-2CG.-2CG.-2CG.+23CGTCGACGTCCCCGACGGCAACA.+23CGTCGACGTCCCCGACGGCAACA.-2CG.-2CG.-2CG.-2CG.-2CG.-2CG.-2CG.+23CGTCGACGTCCCCGACGGCAACA.-2CG.-2CG.-2CG.-2CG.-2CG.+23CGTCGACGTCCCCGACGGCAACAA.-2CGA.-2CG.-2CG.-2CG.-2CG.-2CG.+23CGTCGACGTCCCCGACGGCAACA.-2CG.-2CG.+23CGTCGACGTCCCCGACGGCAACA..+23CGTCGACGTCCCCGACGGCAACA.-2CG.-2CG.-2CG     AFGCGC5GF+G@GFFFEF=FGFG<#FFF6GFGFEGCCCGCGCGFFGCECGFGGGCGCFEFECFEFGGGFGEFCFGCGGFGGFCGFG@EG8EG?ECGEGGCFGCEG7G=F=CFCFCG@CGF>FFG?FCEGFEGCFG2ACEG?GE;GGFFGG?<GGGC=F=FGCFF:DECC8CGGGGF%C#GCGFC<=GCG6GGD

igv screenshot

So the question is why is the insertion not also reported and can this be fixed? Note that I have also tried to use freebayes on this data but get a similar result. It is important that I be able to detect such alternate variant alleles. If varscan or freebayes cannot do this, can anyone suggest a tool that can?

Thanks for your help

Mark

variant insertion deletion mpileup freebayes • 3.5k views
ADD COMMENT
0
Entering edit mode
8.0 years ago
Tonor ▴ 480

I think the command you are using: mpileup2cns - calls the consensus of your sample - in which case only one thing will be reported (the consensus - the deletion -2CG is observed more often).

If you try mpileup2indel on your sample - does that call both the insertion and deletion at your site?

ADD COMMENT
0
Entering edit mode

Thanks for the suggestion but , no, it is as before. Incidentally, I mistakenly said that varscan called the deletion not the insertion. Actually varscan calls the insertion not the deletion and freebayes calls the deletion not the insertion.

ADD REPLY
0
Entering edit mode

Can you post the output for both mpileup2cns and mpileup2indel for the position

ADD REPLY
0
Entering edit mode

From the original output, it looks like the 2bp deletion is observed 134 times, and a 23bp insertion 53 times. So the 2bp deletion seems the correct consensus (so freebayes is correct). Does varscan call a 23bp insertion - or a 21bp insertion - i.e. maybe it merges the insertion and deletion together as they both start with CG.

ADD REPLY
0
Entering edit mode

the mpileupcns vcf

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1
ZmpylD  167     .       G       C       .       PASS    ADP=193;WT=0;HET=1;HOM=0;NC=0   GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR    0/1:200:193:193:134:59:30.57%:9.0798E-21:32:35:134:0:59:0
ZmpylD  168     .       A       T       .       PASS    ADP=193;WT=0;HET=1;HOM=0;NC=0   GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR    0/1:200:193:193:134:59:30.57%:9.0798E-21:33:35:134:0:59:0
ZmpylD  172     .       C       A       .       PASS    ADP=193;WT=0;HET=1;HOM=0;NC=0   GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR    0/1:200:193:193:134:59:30.57%:9.0798E-21:34:33:134:0:59:0
ZmpylD  177     .       TCG     T       .       PASS    ADP=193;WT=0;HET=1;HOM=0;NC=0   GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR    0/1:255:193:193:3:134:69.43%:3.0618E-76:37:35:3:0:134:0

the mpileupindel vcf

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1
ZmpylD  177     .       TCG     T       .       PASS    ADP=193;WT=0;HET=1;HOM=0;NC=0   GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR  0/1:255:193:193:3:134:69.43%:3.0618E-76:37:35:3:0:134:0

So in neither case is the 23 bp (or 21bp) insertion called. For what its worth, IGV recognizes it as a 23bp deletion.

ADD REPLY
0
Entering edit mode

So if you are correct that varscan is calling only the most common variant at each locus, whether you use mpileupcns or mpileupindel, then it is not useful as I have tried to use it since multiple alleles per locus will be common in my data. Likewise for freebayes. If these tools are inadequate (as I have used them) is there another that could do the job? Alternatively, could I take a different approach with varscan? Could I perhaps use it to call variants on every read individually so that multiple alternate alleles would not occur and then tally up the vcf info for each to find variants above my thresholds?

ADD REPLY
0
Entering edit mode
8.0 years ago
Tonor ▴ 480

OK, you said you made a mistake and that VarScan was actually calling the insertion, but the mpileup2cns is showing that VarScan is correctly calling the 2bp deletion (like freebayes is) - the reference was TCG, but your sample just has a T (hence 2bp CG deletion) - so nothing seems wrong there - that is the consensus as the 2bp deletion is the most abundant at that site. If you want info on the alternate alleles at that position (i.e. the 23bp insertion), then calling the consensus is not the way to go - calling variants would be the way to go.

I thought mpileup2indel would call multiple variants at a position - is there no other entry for position 177 in the mpileup2indel output? each indel for a position would be on separate lines. Can you post the lines above and below 177 in the mpileup2indel file?

A good variant calling tool to try is lofreq: http://csb5.github.io/lofreq/

ADD COMMENT
0
Entering edit mode

If there is only one line for position 177, then VarScan mpileup2indel is only reporting the most abundant indel at that position.

If you are interested in all the variants at a position, then lofreq is probably worth a try.

ADD REPLY
0
Entering edit mode

there is only the one

ADD REPLY
0
Entering edit mode

Tried this as you suggested

$ ~/BioInfo/bin/lofreq_star-2.1.2/bin/lofreq call --no-baq --call-indels -f my_ref.fa my.bam
##fileformat=VCFv4.0
##fileDate=20161122
##source=/SD5/people/rosema1/BioInfo/bin/lofreq_star-2.1.2/bin/lofreq call --no-baq --call-indels -f my_ref.fa my.bam
##reference=.fa
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw Depth">
##INFO=<ID=AF,Number=1,Type=Float,Description="Allele Frequency">
##INFO=<ID=SB,Number=1,Type=Integer,Description="Phred-scaled strand bias at this position">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="Counts for ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=CONSVAR,Number=0,Type=Flag,Description="Indicates that the variant is a consensus variant (as opposed to a low frequency variant).">
##INFO=<ID=HRUN,Number=1,Type=Integer,Description="Homopolymer length to the right of report indel position">
##FILTER=<ID=min_dp_10,Description="Minimum Coverage 10">
##FILTER=<ID=sb_fdr,Description="Strand-Bias Multiple Testing Correction: fdr corr. pvalue > 0.001000">
##FILTER=<ID=min_snvqual_44,Description="Minimum SNV Quality (Phred) 44">
##FILTER=<ID=min_indelqual_30,Description="Minimum Indel Quality (Phred) 30">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
my_ref  167     .       G       C       1261    PASS    DP=193;AF=0.305699;SB=0;DP4=134,0,59,0
mY-ref  168     .       A       T       1315    PASS    DP=193;AF=0.305699;SB=0;DP4=134,0,59,0
my_ref  172     .       C       A       1324    PASS    DP=193;AF=0.305699;SB=0;DP4=134,0,59,0
Number of substitution tests performed: 264
Number of indel tests performed: 9

Now perhaps I have not hit upon the right configuration, I will try again tomorrow, but, so far, this doesn't even call the indels. I continue to wonder about my earlier suggestion of calling variants on individual reads with varscan then compiling the results. The nature of my question is such that I would only have to do this on 1-2K reads. What do you think?

ADD REPLY
0
Entering edit mode

Did you run indelqual first?

If you also want to call indels add --call-indels to the parameter list. Note, this requires that your BAM file contains indel qualities (automatically added by GATK’s BQSR or with lofreq indelqual)

ADD REPLY
0
Entering edit mode

Yes, preprocessing the bam file with lofreq indelqual did the trick. Thanks a lot.

Still toying with the idea of calling variants on an individual read level though, then setting thresholds on frequency to eliminate sequencing error. The main advantage of doing this, now that the multi-allelic variant problem is solved, is that it will provide phasing information. I originally wanted to use freebayes for this project because of its proclaimed phasing capabilities but soon found out it is flawed and misses haplotypes it shouldn't. I provided test data to freebayes' author and he confirmed the problem but has not yet found a solution. I have also tried "whatshap" (another haplotyping tool) but thus far it has fallen short as well. Examining individual reads seems like it might provide this since what I am sequencing are amplicons covered end to end with 2 stitched 2X250 Illumina PE reads.

ADD REPLY
0
Entering edit mode

Should have included the lofreq vcf

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
ZmpylD  167     .       G       C       1261    PASS    DP=193;AF=0.305699;SB=0;DP4=134,0,59,0
ZmpylD  168     .       A       T       1315    PASS    DP=193;AF=0.305699;SB=0;DP4=134,0,59,0
ZmpylD  172     .       C       A       1324    PASS    DP=193;AF=0.305699;SB=0;DP4=134,0,59,0
ZmpylD  177     .       T       TCGTCGACGTCCCCGACGGCAACA        1517    PASS    DP=193;AF=0.253886;SB=0;DP4=140,0,49,0;INDEL;HRUN=1
ZmpylD  177     .       TCG     T       4178    PASS    DP=193;AF=0.694301;SB=0;DP4=59,0,134,0;INDEL;HRUN=1
ADD REPLY

Login before adding your answer.

Traffic: 1916 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6