How to Obtain Gene-Transcript Isoforms from RefSeq Format using a gff files as input?
0
0
Entering edit mode
3 days ago

Hello everyone,

I’m currently trying to obtain gene-transcript isoforms from RefSeq format but have encountered some issues. I can retrieve gene and transcript isoforms accurately using BioMart-MartView for Ensembl data, and my output looks something like this (this list of gene and transcript are just an example, I do not want to perform conversion of IDs between Ensembl and NCBI):

Gene stable ID  Transcript stable ID
ENSG00000210049 ENST00000387314
ENSG00000211459 ENST00000389680
ENSG00000210077 ENST00000387342
ENSG00000210082 ENST00000387347
ENSG00000209082 ENST00000386347

However, when I try to convert these Ensembl IDs into RefSeq IDs, the results are often inaccurate and lead to NAs in most cases, if not all (I had used biomartR and Rentrez, which gave me only NAs).

Someone suggested that it might be possible to obtain the isoforms directly from a GFF file using a custom function, that includes the refseqR library. Here's their suggestion:

"It is possible with some coding. Functions in #refseqR take an ID as their first argument. If you obtain the IDs of the genes of your organism outside the package (e.g., from a .gff file), you could pass that vector to the function and determine the IDs/sequences of the isoforms."

Has anyone here successfully implemented a solution for this using RefSeq data or encountered similar issues? I would appreciate any advice or coding tips for extracting the gene-transcript isoforms from RefSeq without losing accuracy! If we check the format of the GFF file, what feature should I extract in the file to obtain the isoforms?:

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build GRCh38.p14
#!genome-build-accession NCBI_Assembly:GCF_000001405.40
#!annotation-date 08/23/2024
#!annotation-source NCBI RefSeq GCF_000001405.40-RS_2024_08
##sequence-region NC_000001.11 1 248956422
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606
NC_000001.11    RefSeq  region  1   248956422   .   +   .   ID=NC_000001.11:1..248956422;Dbxref=taxon:9606;Name=1;chromosome=1;gbkey=Src;genome=chromosome;mol_type=genomic DNA
NC_000001.11    BestRefSeq  pseudogene  11874   14409   .   +   .   ID=gene-DDX11L1;Dbxref=GeneID:100287102,HGNC:HGNC:37102;Name=DDX11L1;description=DEAD/H-box helicase 11 like 1 (pseudogene);gbkey=Gene;gene=DDX11L1;gene_biotype=transcribed_pseudogene;pseudo=true
NC_000001.11    BestRefSeq  transcript  11874   14409   .   +   .   ID=rna-NR_046018.2;Parent=gene-DDX11L1;Dbxref=GeneID:100287102,GenBank:NR_046018.2,HGNC:HGNC:37102;Name=NR_046018.2;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1 (pseudogene);pseudo=true;transcript_id=NR_046018.2
NC_000001.11    BestRefSeq  exon    11874   12227   .   +   .   ID=exon-NR_046018.2-1;Parent=rna-NR_046018.2;Dbxref=GeneID:100287102,GenBank:NR_046018.2,HGNC:HGNC:37102;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1 (pseudogene);pseudo=true;transcript_id=NR_046018.2

Any comments are greatly appreciated! Thanks in advance for your help!

gff isoforms refseq • 245 views
ADD COMMENT
1
Entering edit mode

AGAT toolkit is generally the go to tool for GTF/GFF processing. https://agat.readthedocs.io/en/latest/tools/agat_sp_extract_sequences.html should be what you need.

Edit: After reading your post against it looks like you have Ensembl ID's but a NCBI GFF. So there is going to be an additional step of translation needed for the ID's.

You may be able to simply download the transcripts from Ensembl: https://ftp.ensembl.org/pub/release-113/fasta/mus_musculus/cdna/

ADD REPLY
0
Entering edit mode

Thank you for your quick reply! I'm checking the tool, do you have any recommendations for what feature I need to focus on in the gff file to extract the data that I need?

ADD REPLY
0
Entering edit mode

In theory, I will prefer to avoid any IDs conversion between Ensembl and NCBI. I would like to directly obtain from NCBI or in this case using the tool that you mentioned before. I used the Ensembl as example for what I'm expecting but with NCBI IDs

ADD REPLY
1
Entering edit mode

A suggestion. In future please formulate your question with accurate information. You posted examples of mouse ID's (which appear to be rRNAs) but your GFF example is human.

As long as your NCBI ID's are in that GFF file you should be able to get the sequence. Otherwise you will have to do some mapping of your ID's to what is in the GFF.

ADD REPLY
0
Entering edit mode

You are right, I will edit the post with accurate information!

I will try with AGAT tool that you mentioned before, thank you!

ADD REPLY

Login before adding your answer.

Traffic: 2979 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6