Hi WouterDeCoster,
Given: Two different files
1. 1000G Bi-allelic SNPs 2. My sample Bi-allelic SNPs
Problem:
1. Collect only those variants that intersect between those samples ( i mean output those sites that common in both),
Result: two files (1.) Intersect variants of 1000G with 1000G Genotype information (2.) My samples with same variants of my own sample genotypes.
2. Combine those same variants into single VCF file, with same sites and union of all samples.
In short, common variants in start and then the union of all samples.
Thanks!
I would solve problem one by generating identifiers for your variants (preferably in the smallest file) by concatenating chromosome, position and alternative allele. You can use those identifiers to filter the second file.
e.g.:
#Get the identifiers present in yourfile.vcf
bcftools annotate --set-id '%CHROM\_%POS\_%ALT' yourfile.vcf | cut -f3 > MyIdentifiers.txt
#Give the same type of identifiers to the 1000G data vcf
bcftools annotate --set-id '%CHROM\_%POS\_%ALT' 1000Gdata.vcf > 1000Gdata_withidentifiers.vcf
#Filter the 1000G data to only contain the variants you have in your vcf
java -jar GenomeAnalysisTK.jar -R ref.fasta -T SelectVariants --variant 1000Gdata_withidentifiers.vcf -o 1000G_myvariants.vcf -IDs MyIdentifiers.txt
It's unclear to me what is common between both files. Are the same variants in both files or the same samples?
Hi WouterDeCoster, Given: Two different files 1. 1000G Bi-allelic SNPs 2. My sample Bi-allelic SNPs
Problem:
In short, common variants in start and then the union of all samples. Thanks!