Question

output feature/gene name from a known base position

0

Entering edit mode

7.8 years ago

Dan • 0

Just a little background so that I can put my question into context. I have a list of SNPs and I want to find out which gene/CDS feature these SNPs occur in. For each reference bacterial genome I have gff and gbk files containing my sequence + annotations. Currently, I am using my gff files in Artemis and manually using the navigator to 'goto' my base of interest and manually recording the feature ID (more specifically the product and gene name that this base falls within). With over 200 SNPs to look at per bacterial genome this is understandably taking its time.

Therefore, my question is; is there an easier or more efficient way to do this?

Any advice on this would be very much appreciated and I would like to give my thanks to anyone who might be able to help in advance.

sequence gene SNP • 1.7k views

ADD COMMENT • link updated 7.8 years ago by Emily 24k • written 7.8 years ago by Dan • 0

1

Entering edit mode

7.8 years ago

Emily 24k

You can use the Ensembl VEP to query against our known bacterial species, or if you want to use your own annotation, you can download the script and build a cache using a FASTA and GFF file.

ADD COMMENT • link 7.8 years ago by Emily 24k

score 1 · Accepted Answer · 2017-02-08

With BEDOPS, you can convert your VCF and GFF files to sorted BED files, and then perform set operations on the BED files to find overlaps between both sets. This toolkit will help you automate doing these operations over hundreds of files.

For example, here's how to convert your SNPs and gene annotations to sorted BED files.

$ vcf2bed < snps.vcf > snps.bed
$ gff2bed < genes.gff > genes.bed

Now you can use bedmap to map genes to SNPs:

$ bedmap --echo --echo-map-id-uniq --delim '\t' snps.bed genes.bed > answer.bed

The file answer.bed contains, on each line, a SNP and the ID field of any gene annotations that overlap the position of that SNP.

There are other --echo-map-* operators, in case you want more or other details about the gene annotations. See the documentation for more detail and examples.

If you want to make this even faster and you don't want to keep around intermediate files, you can use process substitution to convert directly:

$ bedmap --echo --echo-map-id-uniq --delim '\t' <(vcf2bed < snps.vcf) <(gff2bed < genes.gff) > answer.bed

With this revision, there's need to first convert to separate files — all the intermediate data are written to Unix streams and processed directly within bedmap.

Once you do this for one pair of SNPs and gene annotations, you could think about writing a shell script to loop over all your pairs of files and do these operations. This will help deal with the N pairs of files you may need to deal with.