Fetch gene names from SNP coordinates
1
2
Entering edit mode
6.1 years ago
Dracaena ▴ 50

Hi all,

Currently I have positions of SNPs in the format of "contig position" e.g. "2 1564". I also have an annotation file in gff format that goes along with the draft genome I am working with. Now I want to retrieve the ID of each gene that has a SNP in it. I know BEDtools does this, but it requires VCF format, which I don't have. Is there any package/software around that would fulfill my needs?

Kind regards

SNP annotation genome gene • 1.9k views
ADD COMMENT
0
Entering edit mode

I know BEDtools does this, but it requires VCF format,

uhh ?

ADD REPLY
0
Entering edit mode

Sorry, I confused it with SAMtools.

ADD REPLY
0
Entering edit mode

I think an example of how your SNPs file looks would be really helpful to solve this.

ADD REPLY
0
Entering edit mode

Some clarification, my SNPs file looks something like this:

1 15
1 516
1 468
3 156
4 468
7 456
7 753

So it gives the scaffold number on which the SNP is located (I am working with a draft genome) together with the base number in that scaffold where the SNP is. Along with that a have a gff3 (https://www.ensembl.org/info/website/upload/gff3.html) file which contains the start and end position of all the genes comrprised in the draft genome. I hope this makes things clearer.

Cheers

ADD REPLY
0
Entering edit mode

so you can convert this to BED and use bedtools , can't you ?

ADD REPLY
0
Entering edit mode
6.1 years ago
n,n ▴ 370

I came up with a scripting solution using bash, linux commands and awk if you want to avoid using BEDtools, also added comments so that you know what is happening:

#!/bin/bash

for position in $(cat SNPs.file | cut -d' ' -f2)
do
  for scaffold in $(cat SNPs.file | cut -d' ' -f1)
  do
    cat file.gff3 | grep -v '#' | awk -F'\t' '{print $4,$5,$3,$9,$1}' | \
    grep "^${position}.*${scaffold}$"
  done
done | sort -n | uniq > output.txt

#Simple script that prints the gff3 file in the following space-delimited
#format: Start_nucleotide, End_nucleotide, feature_type, ID and scaffold number.
#After this column rearrangement it searches the gff3 file for data that you
#have on the SNPs file and prints it if they happen in pairs (scaffold number
#with its corresponding nucleotide start position). You get a new file similar
#to gff3 but only with the features that matched your SNPs.

You can throw the whole thing in a text file on the same directory you have the SNPs file and the gff3 and make it executable. However don't forget to edit the code to replace "SNPs.file" and "file.gff3" with the actual name of your files.

you can also use it directly in your terminal like this:

for position in $(cat SNPs.file | cut -d' ' -f2); do for scaffold in $(cat SNPs.file | cut -d' ' -f1); do cat file.gff3 | grep -v '#' | awk -F'\t' '{print $4,$5,$3,$9,$1}' | grep "^${position}.*${scaffold}$"; done; done | sort -n | uniq > output.txt

EDIT: found this simpler way that seems to be working well

ADD COMMENT
0
Entering edit mode

Hello,

why don't you just convert your SNP file to bed and use bedtools intersect as Pierre suggested?

$ awk -v OFS="\t" '{print $1, $2-1, $2}' SNPs.file > SNPs.bed
$ bedtools intersect -a file.gff3 -b SNPs.bed -wa > output.txt

fin swimmer

ADD REPLY

Login before adding your answer.

Traffic: 2196 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6