vlookup function using awk for two vcf files?
1
0
Entering edit mode
3.5 years ago

Hello everyone,

I have two vcf files:

First vcf I created myself inserting SNP IDs related to traits of interest (GWAS Catalog). REF and ALT for each SNP is missing.

Second vcf I downloaded from https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/ and it contains all human hg38 SNP IDs with REF and ALT information.

VCF file1:

CHROM   POS             ID      REF     ALT
1       8426071         rs302714
1       18813023        rs12063142
1       18875425        rs6695033
1       21911229        rs2445130

VCF file2:

CHROM   POS     ID              REF     ALT             
1       10128   rs796688738     A       AC      .       .       RS=796688738;RSPOS=10128;dbSNPBuildID=146;SSR=0;SAO=0;VP=0x050000020005000002000200;GENEINFO=DDX11L1:100287102;WGT=1;VC=DIV;R5;ASP;TOPMED=0.99734008664627930,0.00265991335372069               
1       10128   rs1457723673    ACCCTAACCCTAACCCTAAC    A       .       .       RS=1457723673;RSPOS=10129;dbSNPBuildID=151;SSR=0;SAO=0;VP=0x050000020005000002000200;GENEINFO=DDX11L1:100287102;WGT=1;VC=DIV;R5;ASP             
1       10131   rs1289482855    CT      C       .       .       RS=1289482855;RSPOS=10132;dbSNPBuildID=151;SSR=0;SAO=0;VP=0x050000020005000002000200;GENEINFO=DDX11L1:100287102;WGT=1;VC=DIV;R5;ASP             
1       10132   rs1436069773    T       C       .       .       RS=1436069773;RSPOS=10132;dbSNPBuildID=151;SSR=0;SAO=0;VP=0x050000020005000002000100;GENEINFO=DDX11L1:100287102;WGT=1;VC=SNV;R5;ASP;TOPMED=0.99974515800203873,0.00025484199796126              
1       10132   rs1390118706    TAACCC  T       .       .       RS=1390118706;RSPOS=10133;dbSNPBuildID=151;SSR=0;SAO=0;VP=0x050000020005000002000200;GENEINFO=DDX11L1:100287102;WGT=1;VC=DIV;R5;ASP             
1       10134   rs1385251551    ACCCTAACCCTAAC  A       .       .       RS=1385251551;RSPOS=10135;dbSNPBuildID=151;SSR=0;SAO=0;VP=0x050000020005000002000200;GENEINFO=DDX11L1:100287102;WGT=1;VC=DIV;R5;ASP             
1       10135   rs1303755152    CCCTAA  C       .       .       RS=1303755152;RSPOS=10136;dbSNPBuildID=151;SSR=0;SAO=0;VP=0x050000020005000002000200;GENEINFO=DDX11L1:100287102;WGT=1;VC=DIV;R5;ASP             
1       10137   rs1346539515    C       G       .       .       RS=1346539515;RSPOS=10137;dbSNPBuildID=151;SSR=0;SAO=0;VP=0x050000020005000002000100;GENEINFO=DDX11L1:100287102;WGT=1;VC=SNV;R5;ASP             
1       10138   rs1228214171    T       C       .       .       RS=1228214171;RSPOS=10138;dbSNPBuildID=151;SSR=0;SAO=0;VP=0x050000020005000002000100;GENEINFO=DDX11L1:100287102;WGT=1;VC=SNV;R5;ASP             
1       10139   rs368469931     A       T       .       .       RS              

I want to look for all SNP IDs of file 1 to file 2 and return columns with REF and ALT alleles.

What do you recommend for this ? I am seeing some options online using awk, but I have been experiencing some issue when it comes to the final output, meaning that I want just 5 columns: CHR POS ID REF ALT

Thanks in advance, Alex

vcf awk vlookup • 1.2k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Ok, thank you for your response!

Best, Alex

ADD REPLY
2
Entering edit mode
3.5 years ago
join -t $'\t' -1 3 -2 3 \
              <(grep -v "#" file1.vcf | cut -f1-5 | sort -T . -t $'\t' -k3,3) \
              <(grep -v "#" file2.vcf | cut -f1-5 | sort -T . -t $'\t' -k3,3)
ADD COMMENT
0
Entering edit mode

Thanks a lot Mr. Lindenbaum!! I see that there was need a significant combination of commands there so I will see to understand which does what. Thanks again!

Best, Alex

ADD REPLY

Login before adding your answer.

Traffic: 2015 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6