Hallo friends,
I am interested in documenting a given chromosomal position for a mutation across "n" numbers of vcf files. How can i compare multiple vcf file for a given mutation in the position?
Than king you.
Hallo friends,
I am interested in documenting a given chromosomal position for a mutation across "n" numbers of vcf files. How can i compare multiple vcf file for a given mutation in the position?
Than king you.
a very fast and efficient way of getting this information is with a simple grep
. say you want to inspect position chr21:123456, then:
grep -P "^chr21\t123456\t" *vcf
you will then end up with a list of files and their lines for that particular position. note that if the variant is not present in a particular file you won't get any output from that file. you may want to check bcftools merge
if you need a more elaborated solution, since it allows you to merge the vcf files filling empty variant sites with ./.
across all files, so that you can query your position of interest being sure that no empty sites are present.
If you want to intersect multiple vcf files into one there are different tools. I usually use vcf-isec
function, inside vcftools
package (http://vcftools.sourceforge.net/perl_examples.html). Then you can just grep your exact position and see what is going on across samples. This is an example of how it works vcf-isec:
# Include positions which appear in at least two files
vcf-isec-o -n +2 A.vcf.gz B.vcf.gz C.vcf.gz | bgzip -c > out.vcf.gz
Don't forget to first compress and tabix all your single vcf files:
bgzip my_file.vcf tabix -p vcf my_file.vcf.gz
If you are only looking for isolated SNPs then the above suggestions should be fine. If your mutations have any complexity (indels, MNPs, or if your mutation is nearby other mutations such that the variant caller may have produced a haplotype call), then you need to use more sophisticated tools.
At the very least, you will need to run your input VCFs through a normalization procedure such as vcfallelicprimitives prior to intersection, although there are still cases that can only handled with sophisticated comparison tools such as vcfeval from RTG Tools or RTG Core. (Although vcfeval is intended for pairwise comparisons so you would have to do successive intersections)/
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thanks Jorge, I too wondered, what if there was SNP at a given position in "x" or "y" file, but then u answered the same, I am looking forward to make a database of all the "valid" mutation across N number of patients is there anything simpler I could do in my sql or R's sql package across 100's such file? I hope am not complicating the query too much!!
I don't know what you mean with "valid" mutations, but
grep
(that's my answer's section 1)Oh!! by valid, I meant sanger cross-checked