Hello everyone,
I’m currently trying to obtain gene-transcript isoforms from RefSeq format but have encountered some issues. I can retrieve gene and transcript isoforms accurately using BioMart-MartView for Ensembl data, and my output looks something like this (this list of gene and transcript are just an example, I do not want to perform conversion of IDs between Ensembl and NCBI):
Gene stable ID Transcript stable ID
ENSG00000210049 ENST00000387314
ENSG00000211459 ENST00000389680
ENSG00000210077 ENST00000387342
ENSG00000210082 ENST00000387347
ENSG00000209082 ENST00000386347
However, when I try to convert these Ensembl IDs into RefSeq IDs, the results are often inaccurate and lead to NAs in most cases, if not all (I had used biomartR and Rentrez, which gave me only NAs).
Someone suggested that it might be possible to obtain the isoforms directly from a GFF file using a custom function, that includes the refseqR library. Here's their suggestion:
"It is possible with some coding. Functions in #refseqR take an ID as their first argument. If you obtain the IDs of the genes of your organism outside the package (e.g., from a .gff file), you could pass that vector to the function and determine the IDs/sequences of the isoforms."
Has anyone here successfully implemented a solution for this using RefSeq data or encountered similar issues? I would appreciate any advice or coding tips for extracting the gene-transcript isoforms from RefSeq without losing accuracy! If we check the format of the GFF file, what feature should I extract in the file to obtain the isoforms?:
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build GRCh38.p14
#!genome-build-accession NCBI_Assembly:GCF_000001405.40
#!annotation-date 08/23/2024
#!annotation-source NCBI RefSeq GCF_000001405.40-RS_2024_08
##sequence-region NC_000001.11 1 248956422
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606
NC_000001.11 RefSeq region 1 248956422 . + . ID=NC_000001.11:1..248956422;Dbxref=taxon:9606;Name=1;chromosome=1;gbkey=Src;genome=chromosome;mol_type=genomic DNA
NC_000001.11 BestRefSeq pseudogene 11874 14409 . + . ID=gene-DDX11L1;Dbxref=GeneID:100287102,HGNC:HGNC:37102;Name=DDX11L1;description=DEAD/H-box helicase 11 like 1 (pseudogene);gbkey=Gene;gene=DDX11L1;gene_biotype=transcribed_pseudogene;pseudo=true
NC_000001.11 BestRefSeq transcript 11874 14409 . + . ID=rna-NR_046018.2;Parent=gene-DDX11L1;Dbxref=GeneID:100287102,GenBank:NR_046018.2,HGNC:HGNC:37102;Name=NR_046018.2;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1 (pseudogene);pseudo=true;transcript_id=NR_046018.2
NC_000001.11 BestRefSeq exon 11874 12227 . + . ID=exon-NR_046018.2-1;Parent=rna-NR_046018.2;Dbxref=GeneID:100287102,GenBank:NR_046018.2,HGNC:HGNC:37102;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1 (pseudogene);pseudo=true;transcript_id=NR_046018.2
Any comments are greatly appreciated! Thanks in advance for your help!
AGAT toolkit is generally the go to tool for GTF/GFF processing. https://agat.readthedocs.io/en/latest/tools/agat_sp_extract_sequences.html should be what you need.
Edit: After reading your post against it looks like you have Ensembl ID's but a NCBI GFF. So there is going to be an additional step of translation needed for the ID's.
You may be able to simply download the transcripts from Ensembl: https://ftp.ensembl.org/pub/release-113/fasta/mus_musculus/cdna/
Thank you for your quick reply! I'm checking the tool, do you have any recommendations for what feature I need to focus on in the gff file to extract the data that I need?
In theory, I will prefer to avoid any IDs conversion between Ensembl and NCBI. I would like to directly obtain from NCBI or in this case using the tool that you mentioned before. I used the Ensembl as example for what I'm expecting but with NCBI IDs
A suggestion. In future please formulate your question with accurate information. You posted examples of mouse ID's (which appear to be rRNAs) but your GFF example is human.
As long as your NCBI ID's are in that GFF file you should be able to get the sequence. Otherwise you will have to do some mapping of your ID's to what is in the GFF.
You are right, I will edit the post with accurate information!
I will try with AGAT tool that you mentioned before, thank you!