So I have a vcf file (myvar.vcf) and I want to do the following:-
1. Remove all mutations tagged as INDELS.
2. Consider mutaions that have a read depth greater than 100
3.To get all the mutations that occur in the tumor genome with respect to the control genome, we select all the mutations that have a genotype of 0/0 for the control genome and have a genotype of 1/1 or 0/1 for the tumor genome.
I have been trying all the three for quite a long time, but with no result. So can anyone help me with the steps? Thanks.
"read depth greater than 100" in both samples? What if one has read depth 100 and the other 99? Total read depth? What if a mutation appears somatic as your tumour has read depth 100 and your normal 0?
i have 12 samples in all. So read depth > 100 in all of them. I would filter out these sites because I dont want to trust
mutations at sites with very high coverage, because they might represent variation between variable copy number repeats, i.e., the reads that map to this location in the reference are actually from duplicated sites in the sample.
Your genotyping filter assumes diploid. This condition is frequently violated in tumours. When you also consider tumour purity and sub-clonality, variant allele frequencies of 0.33/0.66 are perfectly normal even for diploid tumours.
What program generated the VCF? A somatic variant caller (which you should be using if you are doing somatic variant analysis) will typically use the SOMATIC flag to identify tumour-only mutations.
Have you looked at the VCF headers in your file? The R VariantAnnotation package will allow you to perform each of the filters you want to do each in 1 line of code.
Have you tried doing it with SNPeff/SNPsift? The manual is well written and should guide you to do that.
Thanks. Wil try that
"read depth greater than 100" in both samples? What if one has read depth 100 and the other 99? Total read depth? What if a mutation appears somatic as your tumour has read depth 100 and your normal 0?
i have 12 samples in all. So read depth > 100 in all of them. I would filter out these sites because I dont want to trust mutations at sites with very high coverage, because they might represent variation between variable copy number repeats, i.e., the reads that map to this location in the reference are actually from duplicated sites in the sample.
Your genotyping filter assumes diploid. This condition is frequently violated in tumours. When you also consider tumour purity and sub-clonality, variant allele frequencies of 0.33/0.66 are perfectly normal even for diploid tumours.