Question

Fetch gene names from SNP coordinates

2

Entering edit mode

6.2 years ago

Dracaena ▴ 50

Hi all,

Currently I have positions of SNPs in the format of "contig position" e.g. "2 1564". I also have an annotation file in gff format that goes along with the draft genome I am working with. Now I want to retrieve the ID of each gene that has a SNP in it. I know BEDtools does this, but it requires VCF format, which I don't have. Is there any package/software around that would fulfill my needs?

Kind regards

SNP annotation genome gene • 1.9k views

ADD COMMENT • link updated 6.1 years ago by n,n ▴ 370 • written 6.2 years ago by Dracaena ▴ 50

0

Entering edit mode

I know BEDtools does this, but it requires VCF format,

uhh ?

ADD REPLY • link 6.2 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Sorry, I confused it with SAMtools.

ADD REPLY • link 6.2 years ago by Dracaena ▴ 50

0

Entering edit mode

I think an example of how your SNPs file looks would be really helpful to solve this.

ADD REPLY • link 6.2 years ago by n,n ▴ 370

0

Entering edit mode

Some clarification, my SNPs file looks something like this:

1 15
1 516
1 468
3 156
4 468
7 456
7 753

So it gives the scaffold number on which the SNP is located (I am working with a draft genome) together with the base number in that scaffold where the SNP is. Along with that a have a gff3 (https://www.ensembl.org/info/website/upload/gff3.html) file which contains the start and end position of all the genes comrprised in the draft genome. I hope this makes things clearer.

Cheers

ADD REPLY • link 6.2 years ago by Dracaena ▴ 50

0

Entering edit mode

so you can convert this to BED and use bedtools , can't you ?

ADD REPLY • link 6.2 years ago by Pierre Lindenbaum 164k

score 0 · Answer 1 · 2018-10-08

I came up with a scripting solution using bash, linux commands and awk if you want to avoid using BEDtools, also added comments so that you know what is happening:

#!/bin/bash

for position in $(cat SNPs.file | cut -d' ' -f2)
do
  for scaffold in $(cat SNPs.file | cut -d' ' -f1)
  do
    cat file.gff3 | grep -v '#' | awk -F'\t' '{print $4,$5,$3,$9,$1}' | \
    grep "^${position}.*${scaffold}$"
  done
done | sort -n | uniq > output.txt

#Simple script that prints the gff3 file in the following space-delimited
#format: Start_nucleotide, End_nucleotide, feature_type, ID and scaffold number.
#After this column rearrangement it searches the gff3 file for data that you
#have on the SNPs file and prints it if they happen in pairs (scaffold number
#with its corresponding nucleotide start position). You get a new file similar
#to gff3 but only with the features that matched your SNPs.

You can throw the whole thing in a text file on the same directory you have the SNPs file and the gff3 and make it executable. However don't forget to edit the code to replace "SNPs.file" and "file.gff3" with the actual name of your files.

you can also use it directly in your terminal like this:

for position in $(cat SNPs.file | cut -d' ' -f2); do for scaffold in $(cat SNPs.file | cut -d' ' -f1); do cat file.gff3 | grep -v '#' | awk -F'\t' '{print $4,$5,$3,$9,$1}' | grep "^${position}.*${scaffold}$"; done; done | sort -n | uniq > output.txt

EDIT: found this simpler way that seems to be working well