Question

Which SNPs fall within genes and what's their description?

0

Entering edit mode

8.1 years ago

paulotyama • 0

I have a bed file with 13k snps and want to intersect this with a gff file with all the genes annotated for this species. My aim is to find which of my snps fall within genes or other annotated features. Does any one have any suggestions on how to go about this? I'd really appreciate some suggestions and pointers.

Here's my gff file;

head Aradu.Araip_v02.gff | column -t

Aradu.A01      gene  17735860  17739171  ID=Aradu.B2QWP;Name=Aradu.B2QWP;Note=uncharacterized    protein  LOC100797259  isoform  X4  [Glycine  max]%3B  IPR004332  (Transposase%2C  MuDR%2C  plant)
Aradu.A01      mRNA  17735860  17739171  ID=Aradu.B2QWP.1;Parent=Aradu.B2QWP;Name=Aradu.B2QWP.1
Aradu.A01      exon  17735860  17736516  ID=Aradu.B2QWP:exon:0;Parent=Aradu.B2QWP.1
Aradu.A01      exon  17737010  17737038  ID=Aradu.B2QWP:exon:1;Parent=Aradu.B2QWP.1
Aradu.A01      exon  17737236  17737802  ID=Aradu.B2QWP:exon:2;Parent=Aradu.B2QWP.1
Aradu.A01      exon  17738326  17738425  ID=Aradu.B2QWP:exon:3;Parent=Aradu.B2QWP.1
Aradu.A01      exon  17738558  17738633  ID=Aradu.B2QWP:exon:4;Parent=Aradu.B2QWP.1
Aradu.A01      exon  17738752  17738962  ID=Aradu.B2QWP:exon:5;Parent=Aradu.B2QWP.1
Aradu.A01      exon  17739071  17739171  ID=Aradu.B2QWP:exon:6;Parent=Aradu.B2QWP.1

Here's my bed file with snps;

head Allsnps.bed | column -t

chrom      pos     pos     Marker        conversion.type
Aradu.A01  230496  230496  AX-147207636  Polyhigh
Aradu.A01  230616  230616  AX-147207637  Other
Aradu.A01  231463  231463  AX-147207638  Polyhigh
Aradu.A01  253683  253683  AX-147207640  Other
Aradu.A01  390660  390660  AX-147207661  Nominor
Aradu.A01  405426  405426  AX-147207663  Polyhigh
Aradu.A01  413869  413869  AX-147207665  Polyhigh
Aradu.A01  461328  461328  AX-147207676  Nominor
Aradu.A01  785293  785293  AX-147207749  Nominor

Thanks,
Paul

bed gff snp gene • 2.6k views

ADD COMMENT • link updated 2.1 years ago by Ram 45k • written 8.1 years ago by paulotyama • 0

1

Entering edit mode

8.1 years ago

DG 7.3k

You can do this with BEDTools on the command line, or you can also write a small python script with PyBEDTools, which interfaces with BEDTools to accomplish this very easily.

For instance given a BED file of exons and a BED file of SNPS, identifying the SNPs in exons is quite easy:

import pybedtools
exons = pybedtools.example_bedtool('a.bed')
snps = pybedtools.example_bedtool('b.bed')
exons_with_snps = exons.intersect(snps, u=True)

They have plenty of examples in the documentation of working with GFF files, and filtering those files using specific criteria.

ADD COMMENT • link 8.1 years ago by DG 7.3k

1

Entering edit mode

@Dan, is there any advantage to running PyBEDTools vs. using BEDTools directly? (not being snarky, merely showing my ignorance)

ADD REPLY • link 8.1 years ago by harold.smith.tarheel ★ 5.0k

0

Entering edit mode

The only advantage for me is repeatability. If I'm doing something more than just a straight intersection or one line bedtools command its nice to have a script on hand so I'm always doing it the exact same way everytime I do it. There are also some advantages where the sheer amount of lines of code you need to type can be lower, or more clear and straightforward.

ADD REPLY • link 8.1 years ago by DG 7.3k

score 2 · Accepted Answer · 2017-02-16

2

Entering edit mode

8.1 years ago

Floris Brenk ★ 1.0k

http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html

intersectBed -a file1 -b file2 -wb

ADD COMMENT • link 8.1 years ago by Floris Brenk ★ 1.0k

score 2 · Accepted Answer · 2017-02-17

BEDTools are great! Use it myself a lot. At the same time, it is interesting to learn new ways of doing things if you plan to have a career in bioinformatics. Here is a way to same data intersection with awk:

awk 'NR==FNR{print $1, $3,"e_start",$0;print $1, $4,"e_end",$0;}NR!=FNR{print $1, $3,"SNP", $0}' Aradu.Araip_v02.gff Allsnps.bed | \
sort -k1,1 -k2,2n -k7,7nr | \
awk '$3=="e_start"{in_level=1;tmp=$0;sub(/([^ ]+ +){3}/,"",tmp);line[in_level]=tmp; \
  while(in_level!=0){getline; \
    if($3=="e_start"){in_level+=1;tmp=$0;sub(/([^ ]+ +){3}/,"",tmp);line[in_level]=tmp}; \
    if($3=="e_end"){in_level-=1}; \
    if($3=="SNP"){tmp=$0;sub(/([^ ]+ +){3}/,"",tmp); \
      for(i=1;i<=in_level;i++){print tmp,line[i]} \
    } \
  } \
}'

awk reads line by line. NR==FNR is true only for the first file. We add extra rows for each feature in GFF so one line is for start and another line is for the end of the feature. That way we can sort using Unix sort. -k2,2n sorts numerically by second column only. -k7,7nr sorts numerically in reverse order only by the seventh column. As a result, you have a structure that has information about starts and ends of GFF features as well as all SNP from the second file and everything is already sorted by chromosome and position. Now we start another awk on that structure to print only information within features' boundaries together with information about features themselves. We hold information about how many levels we are within the structure because we have features inside features inside features and want to print out all the information. We have technical information that we added in the first two columns and these can be removed with tmp=$0;sub(/([^ ]+ +){3}/,"",tmp); This is a good way to print out all columns after a given one (third column in this case).

While these can be much more complicated than installing and running bedtools, it gives you some experience with Unix, sort and awk that might be useful in the future. Also this gives you way more opportunities to play with data.