Hi
I guess I have a rather simple problem but I couldn't find a solution in previous answers. From the "COVID19hg" GWAS consortium, I am trying to extract all SNPs that intersect with the coordination of a list of genes I'm interested in. I understand it's normally done with tools like bedtools intersect
but the issue is that the GWAS results are not reported in a standard vcf format, and they only allow to download the file in a text format similar to vcf but with no headers and many additional columns. See the head of one example file:
(base)$ head COVID19_HGI_B1_ALL_leave_23andme_20210607.b37.txt
#CHR POS REF ALT SNP all_meta_N all_inv_var_meta_beta all_inv_var_meta_sebeta all_inv_var_meta_p all_inv_var_meta_cases all_inv_var_meta_controls all_inv_var_meta_effective all_inv_var_het_p all_meta_AF rsid anew_chr anew_pos REF ALT
1 693731 A G 1:693731:A:G 12 -2.5351e-02 3.3686e-02 4.5172e-01 7077 51575 5827 4.1912e-01 1.249e-01 rs12238997 1 758351 A G
1 714596 T C 1:714596:T:C 7 -3.3903e-02 7.0748e-02 6.3179e-01 5743 41335 4676 8.8309e-01 3.347e-02 rs149887893 1 779216 T C
1 715265 C T 1:715265:C:T 8 -4.1206e-02 6.4139e-02 5.2059e-01 6213 41615 4851 9.0075e-01 3.687e-02 rs12184267 1 779885 C T
1 715367 A G 1:715367:A:G 8 -3.1029e-02 6.4021e-02 6.2791e-01 6213 41615 4851 8.7083e-01 3.7e-02 rs12184277 1 779987 A G
1 717485 C A 1:717485:C:A 8 -3.705e-02 6.4596e-02 5.6626e-01 6213 41615 4851 8.8184e-01 3.664e-02 rs12184279 1 782105 C A
1 717587 G A 1:717587:G:A 5 -3.2116e-02 1.0398e-01 7.5743e-01 5588 39337 4555 9.8982e-02 1.464e-02 rs144155419 1 782207 G A
1 720381 G T 1:720381:G:T 8 -3.5517e-02 6.3729e-02 5.7732e-01 6213 41615 4851 8.9083e-01 3.732e-02 rs116801199 1 785001 G T
1 721290 G C 1:721290:G:C 8 -3.5073e-02 6.3631e-02 5.8151e-01 6213 41615 4851 8.6413e-01 3.747e-02 rs12565286 1 785910 G C
1 723891 G C 1:723891:G:C 7 1.6154e-02 5.4698e-02 7.6774e-01 5666 42317 4650 7.5031e-01 9.479e-01 rs2977670 1 788511 G C
Now I am trying to intersect this SNPs report file with a GTF annotation file that I have already filtered for my genes of interest. However bedtools does not recognise the SNP report file as a valid entry. I tried to make the file look like a vcf format to be compatible but the trick did not work:
(base)$ bedtools intersect -a COVID19_HGI_B1_ALL_leave_23andme_20210607.b37.vcf -b shortlistGTF.gtf -wa -wb
Error: unable to open file or unable to determine types for file COVID19_HGI_B1_ALL_leave_23andme_20210607.b37.vcf
- Please ensure that your file is TAB delimited (e.g., cat -t FILE).
- Also ensure that your file has integer chromosome coordinates in the
expected columns (e.g., cols 2 and 3 for BED).
Is there any way to input custom txt files (other than BED/VCF/GTF formats) in bedtools and provide argument which columns/fields need to be used for the intersection? I'm quite open to options other than bedtools as well, PICARD, etc. Thank you very much!
Thank you Asaf! it worked perfectly, just a minor point I needed to change the field separator to
tab
otherwise bedtools was giving me the same error:yes, you're correct. You could also do it by setting
OFS
(Output Field Separator) :