Hello everyone,
I have two vcf files:
First vcf I created myself inserting SNP IDs related to traits of interest (GWAS Catalog). REF and ALT for each SNP is missing.
Second vcf I downloaded from https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/ and it contains all human hg38 SNP IDs with REF and ALT information.
VCF file1:
CHROM POS ID REF ALT
1 8426071 rs302714
1 18813023 rs12063142
1 18875425 rs6695033
1 21911229 rs2445130
VCF file2:
CHROM POS ID REF ALT
1 10128 rs796688738 A AC . . RS=796688738;RSPOS=10128;dbSNPBuildID=146;SSR=0;SAO=0;VP=0x050000020005000002000200;GENEINFO=DDX11L1:100287102;WGT=1;VC=DIV;R5;ASP;TOPMED=0.99734008664627930,0.00265991335372069
1 10128 rs1457723673 ACCCTAACCCTAACCCTAAC A . . RS=1457723673;RSPOS=10129;dbSNPBuildID=151;SSR=0;SAO=0;VP=0x050000020005000002000200;GENEINFO=DDX11L1:100287102;WGT=1;VC=DIV;R5;ASP
1 10131 rs1289482855 CT C . . RS=1289482855;RSPOS=10132;dbSNPBuildID=151;SSR=0;SAO=0;VP=0x050000020005000002000200;GENEINFO=DDX11L1:100287102;WGT=1;VC=DIV;R5;ASP
1 10132 rs1436069773 T C . . RS=1436069773;RSPOS=10132;dbSNPBuildID=151;SSR=0;SAO=0;VP=0x050000020005000002000100;GENEINFO=DDX11L1:100287102;WGT=1;VC=SNV;R5;ASP;TOPMED=0.99974515800203873,0.00025484199796126
1 10132 rs1390118706 TAACCC T . . RS=1390118706;RSPOS=10133;dbSNPBuildID=151;SSR=0;SAO=0;VP=0x050000020005000002000200;GENEINFO=DDX11L1:100287102;WGT=1;VC=DIV;R5;ASP
1 10134 rs1385251551 ACCCTAACCCTAAC A . . RS=1385251551;RSPOS=10135;dbSNPBuildID=151;SSR=0;SAO=0;VP=0x050000020005000002000200;GENEINFO=DDX11L1:100287102;WGT=1;VC=DIV;R5;ASP
1 10135 rs1303755152 CCCTAA C . . RS=1303755152;RSPOS=10136;dbSNPBuildID=151;SSR=0;SAO=0;VP=0x050000020005000002000200;GENEINFO=DDX11L1:100287102;WGT=1;VC=DIV;R5;ASP
1 10137 rs1346539515 C G . . RS=1346539515;RSPOS=10137;dbSNPBuildID=151;SSR=0;SAO=0;VP=0x050000020005000002000100;GENEINFO=DDX11L1:100287102;WGT=1;VC=SNV;R5;ASP
1 10138 rs1228214171 T C . . RS=1228214171;RSPOS=10138;dbSNPBuildID=151;SSR=0;SAO=0;VP=0x050000020005000002000100;GENEINFO=DDX11L1:100287102;WGT=1;VC=SNV;R5;ASP
1 10139 rs368469931 A T . . RS
I want to look for all SNP IDs of file 1 to file 2 and return columns with REF and ALT alleles.
What do you recommend for this ? I am seeing some options online using awk, but I have been experiencing some issue when it comes to the final output, meaning that I want just 5 columns: CHR POS ID REF ALT
Thanks in advance, Alex
use tsv-join and documentation here: https://github.com/eBay/tsv-utils/blob/master/docs/tool_reference/tsv-join.md.
Ok, thank you for your response!
Best, Alex