Using bedtools intersect with custom txt files
1
0
Entering edit mode
17 months ago
Meisam ▴ 250

Hi

I guess I have a rather simple problem but I couldn't find a solution in previous answers. From the "COVID19hg" GWAS consortium, I am trying to extract all SNPs that intersect with the coordination of a list of genes I'm interested in. I understand it's normally done with tools like bedtools intersect but the issue is that the GWAS results are not reported in a standard vcf format, and they only allow to download the file in a text format similar to vcf but with no headers and many additional columns. See the head of one example file:

(base)$ head COVID19_HGI_B1_ALL_leave_23andme_20210607.b37.txt 
#CHR    POS REF ALT SNP all_meta_N  all_inv_var_meta_beta   all_inv_var_meta_sebeta all_inv_var_meta_p  all_inv_var_meta_cases  all_inv_var_meta_controls   all_inv_var_meta_effective  all_inv_var_het_p   all_meta_AF rsid    anew_chr    anew_pos    REF ALT
1   693731  A   G   1:693731:A:G    12  -2.5351e-02 3.3686e-02  4.5172e-01  7077    51575   5827    4.1912e-01  1.249e-01   rs12238997  1   758351  A   G
1   714596  T   C   1:714596:T:C    7   -3.3903e-02 7.0748e-02  6.3179e-01  5743    41335   4676    8.8309e-01  3.347e-02   rs149887893 1   779216  T   C
1   715265  C   T   1:715265:C:T    8   -4.1206e-02 6.4139e-02  5.2059e-01  6213    41615   4851    9.0075e-01  3.687e-02   rs12184267  1   779885  C   T
1   715367  A   G   1:715367:A:G    8   -3.1029e-02 6.4021e-02  6.2791e-01  6213    41615   4851    8.7083e-01  3.7e-02 rs12184277  1   779987  A   G
1   717485  C   A   1:717485:C:A    8   -3.705e-02  6.4596e-02  5.6626e-01  6213    41615   4851    8.8184e-01  3.664e-02   rs12184279  1   782105  C   A
1   717587  G   A   1:717587:G:A    5   -3.2116e-02 1.0398e-01  7.5743e-01  5588    39337   4555    9.8982e-02  1.464e-02   rs144155419 1   782207  G   A
1   720381  G   T   1:720381:G:T    8   -3.5517e-02 6.3729e-02  5.7732e-01  6213    41615   4851    8.9083e-01  3.732e-02   rs116801199 1   785001  G   T
1   721290  G   C   1:721290:G:C    8   -3.5073e-02 6.3631e-02  5.8151e-01  6213    41615   4851    8.6413e-01  3.747e-02   rs12565286  1   785910  G   C
1   723891  G   C   1:723891:G:C    7   1.6154e-02  5.4698e-02  7.6774e-01  5666    42317   4650    7.5031e-01  9.479e-01   rs2977670   1   788511  G   C

Now I am trying to intersect this SNPs report file with a GTF annotation file that I have already filtered for my genes of interest. However bedtools does not recognise the SNP report file as a valid entry. I tried to make the file look like a vcf format to be compatible but the trick did not work:

(base)$ bedtools intersect -a COVID19_HGI_B1_ALL_leave_23andme_20210607.b37.vcf -b shortlistGTF.gtf -wa -wb 
Error: unable to open file or unable to determine types for file COVID19_HGI_B1_ALL_leave_23andme_20210607.b37.vcf

- Please ensure that your file is TAB delimited (e.g., cat -t FILE).
- Also ensure that your file has integer chromosome coordinates in the 
  expected columns (e.g., cols 2 and 3 for BED).

Is there any way to input custom txt files (other than BED/VCF/GTF formats) in bedtools and provide argument which columns/fields need to be used for the intersection? I'm quite open to options other than bedtools as well, PICARD, etc. Thank you very much!

bedtools GWAS intersect SNP • 1.1k views
ADD COMMENT
2
Entering edit mode
17 months ago
Asaf 10k

The minimal BED format should have the chromosome, starting position and end position. You can also have the name on the fourth column. Using awk (or even excel for the matter) you can simply duplicate the second column (and add 1)

awk '{print $1,$2,$2+1,$15}' input.txt > output.bed

And then you can use bedtools intersect

ADD COMMENT
0
Entering edit mode

Thank you Asaf! it worked perfectly, just a minor point I needed to change the field separator to tab otherwise bedtools was giving me the same error:

awk '{print $1"\t"$2"\t"$2+1"\t"$15}' input.txt > output.bed
ADD REPLY
1
Entering edit mode

yes, you're correct. You could also do it by setting OFS (Output Field Separator) :

awk  'BEGIN {OFS="\t"} {print $1,$2,$2+1,$15}' input.txt
ADD REPLY

Login before adding your answer.

Traffic: 1893 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6