Hi, I have several bacterial genome from Refseq, I have the faa files, the assembly reports and the gpff files. How can I extract the positions of all genes from each Refseq file?
Hi, I have several bacterial genome from Refseq, I have the faa files, the assembly reports and the gpff files. How can I extract the positions of all genes from each Refseq file?
The file you are looking for is feature_table.txt.gz
located in the same FTP directory where the genome FASTA, assembly report, GPFF, etc are located. For example, this is FTP path for the Salmonella assembly: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/945/GCF_000006945.2_ASM694v2 and you can find the feature_table.txt file in that directory. This is a tab-delimited file with information about all of the features annotated on the genome. Specifically, you can get the range of genes using something along the lines of:
zcat GCF_000006945.2_ASM694v2_feature_table.txt.gz \
| awk 'BEGIN{FS="\t";OFS="\t"}($1~/^#/ || $1=="gene"){print $7,$8,$9,$10,$15,$16,$17}'
genomic_accession start end strand symbol GeneID locus_tag
NC_003197.2 190 255 + thrL 1251519 STM0001
NC_003197.2 325 2799 + thrA 1251520 STM0002
NC_003197.2 2789 3730 + thrB 1251521 STM0003
NC_003197.2 3722 5020 + thrC 1251522 STM0004
Note, the coordinates in this table are 1-based and you should subtract 1 from the start position if you want to use bedtools
for any downstream steps.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Please provide an example of desired output.
I don't really have a desired output, I just want to know each genes position on the genome, start position and end position.
That information is in the GTF files that you can freely download.