Extract only locations of annotated genes from a reference genome
1
1
Entering edit mode
2.9 years ago
Ivan ▴ 60

I have a reference genome ref_genome.fna I got by downloading via datasets download genome via its accession number. GenBank assembly and RefSeq assembly are identical. When I use RefSeq accession number to find it in RefSeq database, I can see features that were annotated, and those are Gene; mRNA; CDS; ncRNA. When I go to it's annotation release, link here, I can see that 25,293 genes are protein coding. Likewise, annotation products are available on ftp site, link here.

What I want to do is extract the locations of those 25,293 protein coding genes (alongside any gene identifier) in a single file, eg.

chrom   chromStart   chromEnd   geneID
chr2       100000      155000   RK031
...         ...        ...      ...

For that purpose, what file do I need to download, and what tool do I need to use?

annotation RefSeq coding-genes • 1.1k views
ADD COMMENT
1
Entering edit mode

You can simply download the transcript and protein sequences for Astyanax mexicanus from NCBI.

Otherwise you can use AGAT (Extracting genomic feature sequences from GTF/GFF files with AGAT ) for extracting this information from genome file.

ADD REPLY
0
Entering edit mode
2.9 years ago
MirianT_NCBI ▴ 760

Hi Ivan,
To create a table like you described in your post, you can use grep and awk to extract the fields from the feature table. Here's how I did it:

grep "protein_coding" GCF_000372685.2_Astyanax_mexicanus-2.0_feature_table.txt | awk -F '\t' '{print substr($5,0,5)$6,$8,$9,$16}' > GCF_000372685.2_Astyanax_mexicanus_protein_coding.txt

I used grep to extract all lines containing "protein_coding"; I piped the output to awk, and extracted the fields 5 (seq_type, and only the first five characters, so you have chrom), 6 (chromosome number), 8 (start), 9 (end) and 16 (GeneID). The fields 5 and 6 are concatenated, so you have only one column with the chromosome info. Those entries without chromosome info (unplaced scaffolds) will appear as unpla. I got only 25,117 protein coding from the feature table. Not sure about the discrepancy though.

I hope it helps :)

ADD COMMENT

Login before adding your answer.

Traffic: 2197 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6