I am new to Linux and programming and am trying to use vcftools. I have 3 vcf files; each one is a different population (i.e. with no shared individuals between the files). I am trying to use vcf-isec to merge the 3 files and end up with one vcf file that contains only the SNPs that are present in all 3 files. I have tried the following code:
vcf-isec -n =3 file1.vcf.gz file2.vcf.gz file3.vcf.gz -f -c > CombinedPops.vcf
and without -c
:
vcf-isec -n =3 file1.vcf.gz file2.vcf.gz file3.vcf.gz -f > CombinedPops.vcf
but I keep ending up with one file with only the individuals from the first input file. It also gives me a warning that "the number of sample columns is different", but I read in another post that -f
forces vcf-isec to output the file regardless. Could this warning be why I can't get a file with ALL the individuals listed? Can vcf-isec even do this?
Although I have read the vcf-isec documentation, I am still not sure exactly what the difference between the -c
and -o
commands are, which may be part of my problem.
Any help is greatly appreciated!
Hi venu, thanks for your help. The vcf-isec code you wrote is basically what I did, but I just specified exactly 3 files rather than 3 or more, and an uncompressed output instead. However, this gave me a file with only the individuals in the first file in it (although it did give me the loci found only in all three files). I have looked at vcf-merge but I think this produces a file with ALL loci? I only want the ones common to all the specified files. I don't have indels, my data is simple SNP data, but I will look further at vcf-annotate. The
-o
flag is mentioned here: http://vcftools.sourceforge.net/perl_module.html#vcf-isec, under "Read More".My bad. I edited. So you need SNPs shared by all 3 files (same chr#, position, base change ..etc)? but not as
vcf-isec
do?Yep exactly.
Hi weedy23, did you finally manage to solve the issue? I am having the same problem as you do...
Hi, sorry I only just saw your comment. I ended up using vcf-merge instead. However, it included ALL loci in the output file, not just loci present in all the files. So I had to go through the output file and delete all the loci that were missing for one or more of the populations. Not ideal but it didn't take too long in the end if you sort the file. Good luck!
how to extract specific variants for A? following command is correct?