Question

Refseq sequences gene position extraction

0

Entering edit mode

4.7 years ago

er.doug.ragnar ▴ 30

Hi, I have several bacterial genome from Refseq, I have the faa files, the assembly reports and the gpff files. How can I extract the positions of all genes from each Refseq file?

Refseq gene position • 1.4k views

ADD COMMENT • link updated 4.7 years ago by vkkodali_ncbi ★ 3.8k • written 4.7 years ago by er.doug.ragnar ▴ 30

0

Entering edit mode

Please provide an example of desired output.

ADD REPLY • link 4.7 years ago by ATpoint 85k

0

Entering edit mode

I don't really have a desired output, I just want to know each genes position on the genome, start position and end position.

ADD REPLY • link 4.7 years ago by er.doug.ragnar ▴ 30

1

Entering edit mode

That information is in the GTF files that you can freely download.

ADD REPLY • link 4.7 years ago by ATpoint 85k

score 1 · Answer 1 · 2020-03-04

The file you are looking for is feature_table.txt.gz located in the same FTP directory where the genome FASTA, assembly report, GPFF, etc are located. For example, this is FTP path for the Salmonella assembly: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/945/GCF_000006945.2_ASM694v2 and you can find the feature_table.txt file in that directory. This is a tab-delimited file with information about all of the features annotated on the genome. Specifically, you can get the range of genes using something along the lines of:

zcat GCF_000006945.2_ASM694v2_feature_table.txt.gz \
  | awk 'BEGIN{FS="\t";OFS="\t"}($1~/^#/ || $1=="gene"){print $7,$8,$9,$10,$15,$16,$17}' 
genomic_accession  start  end    strand  symbol  GeneID   locus_tag
NC_003197.2        190    255    +       thrL    1251519  STM0001
NC_003197.2        325    2799   +       thrA    1251520  STM0002
NC_003197.2        2789   3730   +       thrB    1251521  STM0003
NC_003197.2        3722   5020   +       thrC    1251522  STM0004

Note, the coordinates in this table are 1-based and you should subtract 1 from the start position if you want to use bedtools for any downstream steps.