_1 in gene_ids of human T2T assemby gtf file
1
0
Entering edit mode
2.5 years ago

Hi all,

Any idea why the gene_ids of NCBI's gtf file of T2T human genome assembly have "_1" in the end?

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.gtf.gz

NC_060925.1     BestRefSeq      gene    52979   54612   .       -      .       gene_id "LOC101928626_1"; transcript_id ""; db_xref "GeneID:101928626"; description "uncharacterized LOC101928626"; gbkey "Gene"; gene "LOC101928626"; gene_biotype "lncRNA"; 
NC_060925.1     BestRefSeq      transcript      52979   54612   .       -       .      gene_id "LOC101928626_1"; transcript_id "NR_125957.1"; db_xref "GeneID:101928626"; exception "annotated by transcript or proteomic data"; gbkey "ncRNA"; gene "LOC101928626"; inference "similar to RNA sequence (same species):RefSeq:NR_125957.1"; note "The RefSeq transcript has 2 substitutions, 1 non-frameshifting indel compared to this genomic sequence"; product "uncharacterized LOC101928626"; transcript_biotype "lnc_RNA"; 
NC_060925.1     BestRefSeq      exon    54522   54612   .       -       .       gene_id "LOC101928626_1"; transcript_id "NR_125957.1"; db_xref "GeneID:101928626"; exception "annotated by transcript or proteomic data"; gene "LOC101928626"; inference "similar to RNA sequence (same species):RefSeq:NR_125957.1"; note "The RefSeq transcript has 2 substitutions, 1 non-frameshifting indel compared to this genomic sequence"; product "uncharacterized LOC101928626"; transcript_biotype "lnc_RNA"; exon_number "1"; 
NC_060925.1     BestRefSeq      gene    111940  112877  .       -      .       gene_id "OR4F29_1"; transcript_id ""; db_xref "GeneID:729759"; db_xref "HGNC:HGNC:31275"; description "olfactory receptor family 4 subfamily F member 29"; gbkey "Gene"; gene "OR4F29"; gene_biotype "protein_coding"; gene_synonym "OR7-21"; 
NC_060925.1     BestRefSeq   transcript      111940  112877  .       -       .       gene_id "OR4F29_1"; transcript_id "NM_001005221.2"; db_xref "GeneID:729759"; exception "annotated by transcript or proteomic data"; gbkey "mRNA"; gene "OR4F29"; inference "similar to RNA sequence, mRNA (same species):RefSeq:NM_001005221.2"; note "The RefSeq transcript has 9 substitutions, 1 frameshift compared to this genomic sequence"; product "olfactory receptor family 4 subfamily F member 29"; tag "RefSeq Select"; transcript_biotype "mRNA";

It breaks some analyses for GO enrichment/GSEA. Is it safe just to remove these underscores?

cheers

gtf T2T • 1.2k views
ADD COMMENT
3
Entering edit mode
2.5 years ago
vkkodali_ncbi ★ 3.8k

Thank you for bringing this up! The _# suffix is a counter added to ensure uniqueness, but an unanticipated outcome of how the data were processed by our pipeline resulted in the counter being applied excessively, and we’re working on a fix. Unfortunately, it’s not as simple as universally dropping all _# suffixes because some genes are intentionally annotated in multiple locations (e.g. chrX & Y in the PAR region). Dropping specifically the _1 suffixes is largely ok. Or if you can leave the GTF file as-is and rely on either the gene=ABCD attribute or drop the _# suffix in a post-processing step before doing the GO enrichment/GSEA analysis, that would be most reliable.

UPDATE (07-13-2022): The files on the FTP are now fixed.

ADD COMMENT
0
Entering edit mode

awesome, thanks!

ADD REPLY

Login before adding your answer.

Traffic: 2191 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6