Question

filtering the SNPs from vcf file

0

Entering edit mode

7.5 years ago

DL ▴ 50

Hi,

I want to filter Snps from vcf file but i am confused that which parameter is good for SNPs filtering. In my vcf file i have found several condition that confused me. I show some of lines of vcf output.

                            FORMAT                  INFO                                            
**#CHROM    POS ID  REF ALT QUAL    FILTER  GT  AD  DP  GQ  PL  AC  AF  AN  INFO**                              
Chr01   16434   .   T   A   32.77   .   0/1 53,6    59  61  61,0,2212   AC=1    0.500   2   BaseQRankSum=0.226  ClippingRankSum=0.000   DP=135  ExcessHet=3.0103    FS=1.850    MLEAC=1 MLEAF=0.500 MQ=32.25    MQRankSum=1.082
Chr01   103148  .   C   A   1017.77 .   0/1 25,3    55  99  1046,0,886  AC=1    0.500   2   BaseQRankSum=0.009  ClippingRankSum=0.000   DP=55   ExcessHet=3.0103    FS=0.000    MLEAC=1 MLEAF=0.500 MQ=60.20    MQRankSum=0.949
Chr01   15650   .   C   A   424.77  .   0/1 3,11    14  58  453,0,58    AC=1    0.500   2   BaseQRankSum=0.853  ClippingRankSum=0.000   DP=25   ExcessHet=3.0103    FS=0.000    MLEAC=1 MLEAF=0.500 MQ=49.38    MQRankSum=0.585 QD=30.34    ReadPosRankSum=1.479    SOR=0.760           
Chr01   15651   .   C   A   424.77  .   0/1 3,11    14  58  453,0,58    AC=1    0.500   2   BaseQRankSum=0.763  ClippingRankSum=0.000   DP=25   ExcessHet=3.0103    FS=0.000    MLEAC=1 MLEAF=0.500 MQ=49.38    MQRankSum=0.585 QD=30.34    ReadPosRankSum=1.481    SOR=0.760

Now if you see this result, in the first line of result AD=53,6. It means 53 reads have same allele like reference and 6 reads have alternate allele. Is it right that i am saying. If not please tell me what is that?? If i am right then it is good snp ?? My second question is : There are some SNPs that have different DP in info and format column. For those what should i do?? And i read about this and i found that DP of info column is total reads depth and DP in format column is allelic depth. So it would be better to select the SNPs on the basis of allelic depth. Please explain me how should i select the SNPs ??

Thanks in advance

SNP genome snp next-gen sequence • 11k views

ADD COMMENT • link updated 4.1 years ago by ashotmarg2004 ▴ 130 • written 7.5 years ago by DL ▴ 50

0

Entering edit mode

Why do you want to filter them? What is your ultimate goal? These are parameters you can use to filter the file, but not unless you're clear on what you need exactly.

ADD REPLY • link 7.5 years ago by Ram 44k

0

Entering edit mode

Thanks to reply. I want to filter true SNPs.but before it i want to understand the results.

ADD REPLY • link 7.5 years ago by DL ▴ 50

0

Entering edit mode

I assume you're looking for true variants and avoid false positives - if you're looking for polymorphisms, you might need to set some criteria based on population allele frequency and also look into phenotypic effects.

ADD REPLY • link 7.5 years ago by Ram 44k

0

Entering edit mode

For tool to filer you can use SnpSift.
After that been said; first thing first, as said by @Ram why you want to filter and what is the question you are trying to answer?

there is a nice filtering example decision making can be found here http://userweb.eng.gla.ac.uk/cosmika.goswami/snp_calling/SNPCalling.html

section 8

ADD REPLY • link 7.5 years ago by Medhat 9.8k

0

Entering edit mode

Thank you. I used most of tools but every time i have question ; is it true snp or not?? Can you please tell me that why DP value is different in info and format column??

Thanks

ADD REPLY • link 7.5 years ago by DL ▴ 50

1

Entering edit mode

The difference between DP filed and AD filed is:

AD and DP : Allele depth and depth of coverage. These are complementary fields that represent two important ways of thinking about the depth of the data for this sample at this site. AD is the unfiltered allele depth, i.e. the number of reads that support each of the reported alleles. All reads at the position (including reads that did not pass the variant caller’s filters) are included in this number, except reads that were considered uninformative. Reads are considered uninformative when they do not provide enough statistical evidence to support one allele over another. DP is the filtered depth, at the sample level. This gives you the number of filtered reads that support each of the reported alleles. You can check the variant caller’s documentation to see which filters are applied by default. Only reads that passed the variant caller’s filters are included in this number. However, unlike the AD calculation, uninformative reads are included in DP. See the Tool Documentation for more details on AD (DepthPerAlleleBySample) and DP (Coverage) for more details.

ADD REPLY • link 7.5 years ago by Medhat 9.8k

0

Entering edit mode

Thank you for your informative response. I read about this. Can you please tell me that why AD value is always smaller than DP value in my result file. Actually there is huge difference between AD and DP value in my result file. i read that the sum of AD may be different than the individual sample depth, especially when there are many non-informative reads. So it means when the reads were align to particular position then most of reads are non-informative or did not proper align in my data?? Thanks

ADD REPLY • link 7.5 years ago by DL ▴ 50

0

Entering edit mode

Reads that are not used for calling are not counted in the DP measure, but are included in AD

ADD REPLY • link 7.5 years ago by Medhat 9.8k

0

Entering edit mode

It means then AD >= DP ?? am i right or not ?? I am bothering to much but i want to clear my concept in this field because i am new to analysis this type of data. So i apologize for that.

ADD REPLY • link 7.5 years ago by DL ▴ 50

0

Entering edit mode

Yes, you understand it right :)

ADD REPLY • link 7.5 years ago by Medhat 9.8k

score 0 · Answer 1 · 2020-12-10

This topic has many "it depends" points but with my limited experience I can point out few very general things:

If it's e.g. human data, GATK has a good best practice pipelines where they utilise the well known INDELs and SNPs in the human genome to e.g. recalibrate the variant quality scores. See the GATK Variant Quality Score Recalibration (VQSR).

If it's non-human (or non model organisms that you may not even have the correct reference sequence) one might have to do manual filtering. Considering things like e.g. QualByDepth (QD), Strand Bias (FS), StrandOddsRatio (SOR), RMSMappingQuality (MQ), MappingQualityRankSumTest (MQRankSum), ReadPosRankSumTest (ReadPosRankSum). These are covered in GATK "Hard-filtering germline short variants" section.

Some other things to consider, depending how strict one wants to be they should change the parameters accordingly e.g:

min and max depth filter for each site: using some cutoff values such as +/- 2x the mean depth estimate, or based on some stDev values
allelic bias: the ratio of ref and alt allele read numbers, this is already somewhat reflected in the genotype qualities, but can be manually changes to be more conservative
filter sites with low genotype quality
remove some SNPs around the indels assuming these are problematic regions: i.e. filter out the sites +/- x bp around the indels and
etc ....

E.g. bcftools has some powerful expressions for filtering.