Refseq sequences gene position extraction
1
0
Entering edit mode
4.7 years ago

Hi, I have several bacterial genome from Refseq, I have the faa files, the assembly reports and the gpff files. How can I extract the positions of all genes from each Refseq file?

Refseq gene position • 1.4k views
ADD COMMENT
0
Entering edit mode

Please provide an example of desired output.

ADD REPLY
0
Entering edit mode

I don't really have a desired output, I just want to know each genes position on the genome, start position and end position.

ADD REPLY
1
Entering edit mode

That information is in the GTF files that you can freely download.

ADD REPLY
1
Entering edit mode
4.7 years ago
vkkodali_ncbi ★ 3.8k

The file you are looking for is feature_table.txt.gz located in the same FTP directory where the genome FASTA, assembly report, GPFF, etc are located. For example, this is FTP path for the Salmonella assembly: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/945/GCF_000006945.2_ASM694v2 and you can find the feature_table.txt file in that directory. This is a tab-delimited file with information about all of the features annotated on the genome. Specifically, you can get the range of genes using something along the lines of:

zcat GCF_000006945.2_ASM694v2_feature_table.txt.gz \
  | awk 'BEGIN{FS="\t";OFS="\t"}($1~/^#/ || $1=="gene"){print $7,$8,$9,$10,$15,$16,$17}' 
genomic_accession  start  end    strand  symbol  GeneID   locus_tag
NC_003197.2        190    255    +       thrL    1251519  STM0001
NC_003197.2        325    2799   +       thrA    1251520  STM0002
NC_003197.2        2789   3730   +       thrB    1251521  STM0003
NC_003197.2        3722   5020   +       thrC    1251522  STM0004

Note, the coordinates in this table are 1-based and you should subtract 1 from the start position if you want to use bedtools for any downstream steps.

ADD COMMENT

Login before adding your answer.

Traffic: 1604 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6