Entering edit mode
5.9 years ago
jaafari.omid
▴
80
Dears all, I have a vcf file from whole genome data on a fish species. This vcf file has prepared with bcftools pipeline. I have read something for filtering the vcf file but I couldn't make heads or tails of them. I know there is GATK pipeline but because I used bowtie2 for mapping I am not able to use that pipeline, I think. So I will be so grateful if you can help to filter my vcf file with bcftools or vcftools or any other ways.
Any help will be appreciated in advance.
Regards, Omid
Hello jaafari.omid ,
what's the goal of your filtering? With bcftools view you can create subsets on nearly any criteria you like.
fin swimmer
Hi, Actually I was looking to find a straightforward pipeline for filtering. Should I just do filtering for minimum genotype depth and quality? Or It is better to consider some other type of filtration?like var filtering? or removing In/Del and copy numbers?
Sorry to say, but there is no "straightforward pipeline for filtering".
This depends on library prep, sequencing platform, the way you do alignment/mapping and variant calling and of course what you try to find out.
Without more information about this, none can give you definitely answer.
It sounds a bit like you are asking about how to remove false positive variants from your file. But here it is important to know what is more important for you: specificity or sensitivity ?
fin swimmer
So first of all I should consider those parameters. But I thought I can at least remove individuals with a specific level of missing data, also considering their minimum genotype depth and quality. For mapping I used bowtie2 with --very0sensitive option and here is the command I used for variant calling.
Actually I am looking to find the SNPs which are outliers between different groups. Of course removing the false positive variants is a good idea, but still I can't understand the difference between specificity and sensitivity?
I would upload my .vcf files in galaxy and using bcftools I would filterate my files. For example in galaxy the default for DP is 10 and you can change that.
Thanks for your answer, Then can I consider the MAF by using Galaxy?
Sorry I am not sure, I have started whole genome sequencing since 2 weeks ago. But, I found a tool named MAFtools in R very helpful although I have not used that yet.
Many many thanks for your comments.
You are most welcome, best of luck
Do you mean the MAF in your own dataset or the MAFs from the large consortia, like 1000 Genomes?
I meant my own data set, filtering my reads on vcf file based on MAF.
Cool. For that, I recommend
bcftools view
, with the following option:For the other filtering that you mentioned earlier, if you have BCFtools, you should also have a Perl executable called vcfutils.pl, which has much extra functionality on top of
mpileup
andcall
:There are two types of filtering: (1) Quality filtering and (2) filter for variants of interest.
Unfortunately not all variants in your vcf file are true variants. Errors introduced during library prep, sequencing and alignment leads to false positive variants. If you have a lot of variants it is useful to first try to eliminate those variants before looking for variants of interest.
Sensitivity and specificity are terms that describe how reliable your dataset is. Sensitivity is the answer to the question, about how many of the true variants in my sample I'm able to detect. Specificity is the answer to question, about how many other variants beside the true variants will I detect.
Usually the sensitivity of a NGS analysis is quite high, which means you will detect most of the true variants. Due to the errors mentioned above the specificity could be not that high, because you have a certain number of false positive.
The goal for quality filtering is to increase the specificity by removing those false positive. Depending on the filter criteria one will also remove true positiv, which leads to a decreasing sensitivity. So whenever doing quality filtering, you have to ask your self what is more important: Be sure to have all true variants or to have a clean dataset, where I can be sure all variants are true but some are missing.
Thank you very much for your very helpful explanation. So I think the specificity is important to me and try to keep the final file clean.
Regards,