SNP at every chromosome position
1
0
Entering edit mode
9.2 years ago
waqasnayab ▴ 250

Hi,

I have SOLiD sequencing data. After aligning with SHRiMP2, I used samtools mpileup for SNP calling:

samtools mpileup -C50 -gf hg38.fa -o var.raw.bcf input.bam bcftools call -o var.raw.vcf -O v -c var.raw.bcf

The raw vcf file looks like this:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SINDHI
chr1 33953 . T . 56.8087 PASS DP=11;MQSB=0.950952;MQ0F=1;AF1=0;AC1=0;DP4=6,5,0,0;MQ=0;FQ=-59.9998 GT:PL 0/0:0
chr1 33954 . C . 56.8087 PASS DP=11;MQSB=0.950952;MQ0F=1;AF1=0;AC1=0;DP4=6,5,0,0;MQ=0;FQ=-59.9998 GT:PL 0/0:0
chr1 33955 . T . 56.4609 PASS DP=10;MQSB=0.952347;MQ0F=1;AF1=0;AC1=0;DP4=5,5,0,0;MQ=0;FQ=-56.997 GT:PL 0/0:0
chr1 35396 . C . 56.4609 PASS DP=10;MQSB=0.952347;MQ0F=1;AF1=0;AC1=0;DP4=5,5,0,0;MQ=0;FQ=-56.997 GT:PL 0/0:0
chr1 35397 . C . 61.2368 PASS DP=12;MQSB=0.95494;MQ0F=1;AF1=0;AC1=0;DP4=5,7,0,0;MQ=0;FQ=-62.9905 GT:PL 0/0:0
chr1 35398 . C . 61.2368 PASS DP=12;MQSB=0.95494;MQ0F=1;AF1=0;AC1=0;DP4=5,7,0,0;MQ=0;FQ=-62.9905 GT:PL 0/0:0
chr1 35399 . A . 61.2368 PASS DP=12;MQSB=0.95494;MQ0F=1;AF1=0;AC1=0;DP4=5,7,0,0;MQ=0;FQ=-62.9905 GT:PL 0/0:0
chr1 35400 . A . 61.2368 PASS DP=12;MQSB=0.95494;MQ0F=1;AF1=0;AC1=0;DP4=5,7,0,0;MQ=0;FQ=-62.9905 GT:PL 0/0:0
chr1 35401 . C . 61.2368 PASS DP=12;MQSB=0.95494;MQ0F=1;AF1=0;AC1=0;DP4=5,7,0,0;MQ=0;FQ=-62.9905 GT:PL 0/0:0 As

You can see, the chromosome position is continuous. I know that these are raw variant file but it contains 333,399,862 variants (almost as the number of bases). So, how can I filter this (there are so many false positives), I need filtration or should I do something at mpileup or bcftools call stage?

variants mpileup samtools • 2.4k views
ADD COMMENT
2
Entering edit mode
9.2 years ago

In the way you executed samtools/bcftools, it will print out every position with a pileup, i.e., at least one mapped read. To print only variants where there is some evidence for a variant use the '-v' command line parameter:

$ bcftools call
[...]
Input/output options:
[...]
  -v, --variants-only             output variant sites only
[...]

You can also see that you have actually no variants predicted as the column for alternative sequence (ALT) is empty (".") and all mapped reads at that position support the reference sequence (e.g., INFO column DP4="6,5,0,0" does mean 6 reads on the forward and 5 reads on the reverse strand support the reference while none support the/an alternative).

Important: Still there will be a large number of false positive calls, so you will have to use some filtering procedure, e.g., removing variants with low coverage or low allele frequency. For this, have a look at bcftools filter and select suitable thresholds. I think, there is no ground-truth for suitable thresholds and people use different setting according on their needs. Probably, you find some paper doing a similar analysis or on a similar organism or similar study or whatsoever and you can adopt their setting to be comparable.

ADD COMMENT
0
Entering edit mode

Like! Its great anwer. thx

ADD REPLY

Login before adding your answer.

Traffic: 1739 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6