Question

StringTIe Error: no valid ID found for GFF record

4

Entering edit mode

3.8 years ago

1234gingko ▴ 50

hi, I successfully aligned and analyzed my RNA-Seq data using Hisat2 and StringTIe and DESeq2 with the La_Amiga3_1 genome (white lupin) from NCBI to map transcripts. Beginner's luck. Now I am trying to do the exact same thing using the CNRS_Lalb genome (also white lupin on NCBI), and when I get to the first StringTIe step, I get "Error: no valid ID found for GFF record". I have looked at both the genome GTF files, and the first field (chromosome id) looks great (cut -f 1 *.gtf | sort | uniq) and they have a different name for the chromosomes, but look fine. I don't think that is the problem, and am looking for more hints as to what this means - I did read the StringTie manual but need more help. thanks very much, K

RNA-Seq • 9.2k views

ADD COMMENT • link updated 2.7 years ago by Juke34 8.9k • written 3.8 years ago by 1234gingko ▴ 50

1

Entering edit mode

omg, thanks so much. this enabled me to find a prior post: Ensembl GTF format: isn't the tag "transcript_id" mandatory?
in which Ensembl explains the evolution of GTF format and suggests exactly what you suggest:
"I would recommend removing the gene lines from the gtf file". This gets me back on track so fast, I appreciate it! - Karen

ADD REPLY • link 3.8 years ago by 1234gingko ▴ 50

0

Entering edit mode

Can you please post a couple of lines of the GTF file?

ADD REPLY • link 3.8 years ago by i.sudbery 20k

0

Entering edit mode

sure, thanks:

head -50 CN*/*.gtf
#gtf-version 2.2
#!genome-build CNRS_Lalb_1.0
#!genome-build-accession NCBI_Assembly:GCA_009771035.1
WOCE01000065.1  Genbank gene    90  241 .   -   .   gene_id "Lalb_Chr00c40g0409271"; transcript_id ""; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409271"; 
WOCE01000065.1  Genbank exon    90  241 .   -   .   gene_id "Lalb_Chr00c40g0409271"; transcript_id "Lalb_Chr00c40g0409271"; gbkey "ncRNA"; locus_tag "Lalb_Chr00c40g0409271"; product "hypothetical ncRNA"; exon_number "1"; 
WOCE01000065.1  Genbank gene    417 575 .   -   .   gene_id "Lalb_Chr00c40g0409281"; transcript_id ""; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409281"; 
WOCE01000065.1  Genbank exon    417 575 .   -   .   gene_id "Lalb_Chr00c40g0409281"; transcript_id "Lalb_Chr00c40g0409281"; gbkey "ncRNA"; locus_tag "Lalb_Chr00c40g0409281"; product "hypothetical ncRNA"; exon_number "1"; 
WOCE01000065.1  Genbank gene    1131    1283    .   -   .   gene_id "Lalb_Chr00c40g0409291"; transcript_id ""; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409291"; 
WOCE01000065.1  Genbank exon    1131    1283    .   -   .   gene_id "Lalb_Chr00c40g0409291"; transcript_id "Lalb_Chr00c40g0409291"; gbkey "ncRNA"; locus_tag "Lalb_Chr00c40g0409291"; product "hypothetical ncRNA"; exon_number "1"; 
WOCE01000065.1  Genbank gene    1698    1816    .   -   .   gene_id "Lalb_Chr00c40g0409301"; transcript_id ""; gbkey "Gene"; gene_biotype "rRNA"; locus_tag "Lalb_Chr00c40g0409301"; note "5s_rRNA"; 
WOCE01000065.1  Genbank exon    1698    1816    .   -   .   gene_id "Lalb_Chr00c40g0409301"; transcript_id "Lalb_Chr00c40g0409301"; gbkey "rRNA"; locus_tag "Lalb_Chr00c40g0409301"; product "5S ribosomal RNA"; exon_number "1"; 
WOCE01000065.1  Genbank gene    2001    2152    .   -   .   gene_id "Lalb_Chr00c40g0409311"; transcript_id ""; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409311"; 
WOCE01000065.1  Genbank exon    2001    2152    .   -   .   gene_id "Lalb_Chr00c40g0409311"; transcript_id "Lalb_Chr00c40g0409311"; gbkey "ncRNA"; locus_tag "Lalb_Chr00c40g0409311"; product "hypothetical ncRNA"; exon_number "1"; 
WOCE01000065.1  Genbank gene    2330    2481    .   -   .   gene_id "Lalb_Chr00c40g0409321"; transcript_id ""; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409321"; 
WOCE01000065.1  Genbank exon    2330    2481    .   -   .   gene_id "Lalb_Chr00c40g0409321"; transcript_id "Lalb_Chr00c40g0409321"; gbkey "ncRNA"; locus_tag "Lalb_Chr00c40g0409321"; product "hypothetical ncRNA"; exon_number "1"; 
WOCE01000065.1  Genbank gene    2659    2810    .   -   .   gene_id "Lalb_Chr00c40g0409331"; transcript_id ""; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409331";

ADD REPLY • link updated 3.8 years ago by Istvan Albert 102k • written 3.8 years ago by 1234gingko ▴ 50

score 5 · Answer 1 · 2021-02-02

My guess is its those lines with transcript_id=="", they don't contain a valid ID, and so StringTie is complaining. Its always a bit of the worry to work out what to do with a the transcript_id field on gene lines in a GTF file. The orignal GTF format didn't contain gene lines, but they appear to have crept in at some point. The ENSEMBL files just don't have a transcript_id field on their gene lines, but i bet that trips StringTie up as well.

For for what to do: I recommend just removing the gene lines. They are not necessary anyway. Something like:

awk '$3 != "gene" ' my_annotation.gtf > my_annotation_no_genes.gtf

score 0 · Answer 2 · 2022-04-01

0

Entering edit mode

2.7 years ago

bio • 0

Hi! I also suffer the same problem,and i don't know how to fix it

ADD COMMENT • link 2.7 years ago by bio • 0

score 0 · Answer 3 · 2022-04-01

You can try AGAT

Input:

WOCE01000065.1  Genbank gene    90  241 .   -   .   gene_id "Lalb_Chr00c40g0409271"; transcript_id ""; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409271"; 
WOCE01000065.1  Genbank exon    90  241 .   -   .   gene_id "Lalb_Chr00c40g0409271"; transcript_id "Lalb_Chr00c40g0409271"; gbkey "ncRNA"; locus_tag "Lalb_Chr00c40g0409271"; product "hypothetical ncRNA"; exon_number "1"; 
WOCE01000065.1  Genbank gene    417 575 .   -   .   gene_id "Lalb_Chr00c40g0409281"; transcript_id ""; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409281"; 
WOCE01000065.1  Genbank exon    417 575 .   -   .   gene_id "Lalb_Chr00c40g0409281"; transcript_id "Lalb_Chr00c40g0409281"; gbkey "ncRNA"; locus_tag "Lalb_Chr00c40g0409281"; product "hypothetical ncRNA"; exon_number "1"

Remove transcript_id attribute to gene feature:
agat_sp_manage_attributes.pl --gff test.gtf -p gene --att transcript_id -o test.gff

Output:

##gff-version 3
WOCE01000065.1  Genbank gene    90  241 .   -   .   ID=nbis-gene-1;gbkey=Gene;gene_biotype=ncRNA;gene_id=Lalb_Chr00c40g0409271;locus_tag=Lalb_Chr00c40g0409271
WOCE01000065.1  Genbank RNA 90  241 .   -   .   ID=Lalb_Chr00c40g0409271;Parent=nbis-gene-1;gbkey=Gene;gene_biotype=ncRNA;gene_id=Lalb_Chr00c40g0409271;locus_tag=Lalb_Chr00c40g0409271
WOCE01000065.1  Genbank exon    90  241 .   -   .   ID=exon-1;Parent=Lalb_Chr00c40g0409271;exon_number=1;gbkey=ncRNA;gene_id=Lalb_Chr00c40g0409271;locus_tag=Lalb_Chr00c40g0409271;product=hypothetical ncRNA;transcript_id=Lalb_Chr00c40g0409271
WOCE01000065.1  Genbank gene    417 575 .   -   .   ID=nbis-gene-2;gbkey=Gene;gene_biotype=ncRNA;gene_id=Lalb_Chr00c40g0409281;locus_tag=Lalb_Chr00c40g0409281
WOCE01000065.1  Genbank RNA 417 575 .   -   .   ID=Lalb_Chr00c40g0409281;Parent=nbis-gene-2;gbkey=Gene;gene_biotype=ncRNA;gene_id=Lalb_Chr00c40g0409281;locus_tag=Lalb_Chr00c40g0409281
WOCE01000065.1  Genbank exon    417 575 .   -   .   ID=exon-2;Parent=Lalb_Chr00c40g0409281;exon_number=1;gbkey=ncRNA;gene_id=Lalb_Chr00c40g0409281;locus_tag=Lalb_Chr00c40g0409281;product=hypothetical ncRNA;transcript_id=Lalb_Chr00c40g0409281

Convert into GTF agat_convert_sp_gff2gtf.pl --gff test.gff -o --gff test_clean.gtf

Output:

##gtf-version 3
WOCE01000065.1  Genbank gene    90  241 .   -   .   gene_id "Lalb_Chr00c40g0409271"; ID "nbis-gene-1"; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409271";
WOCE01000065.1  Genbank transcript  90  241 .   -   .   gene_id "Lalb_Chr00c40g0409271"; transcript_id "Lalb_Chr00c40g0409271"; ID "Lalb_Chr00c40g0409271"; Parent "nbis-gene-1"; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409271"; original_biotype "rna";
WOCE01000065.1  Genbank exon    90  241 .   -   .   gene_id "Lalb_Chr00c40g0409271"; transcript_id "Lalb_Chr00c40g0409271"; ID "exon-1"; Parent "Lalb_Chr00c40g0409271"; exon_number "1"; gbkey "ncRNA"; locus_tag "Lalb_Chr00c40g0409271"; product "hypothetical ncRNA";
WOCE01000065.1  Genbank gene    417 575 .   -   .   gene_id "Lalb_Chr00c40g0409281"; ID "nbis-gene-2"; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409281";
WOCE01000065.1  Genbank transcript  417 575 .   -   .   gene_id "Lalb_Chr00c40g0409281"; transcript_id "Lalb_Chr00c40g0409281"; ID "Lalb_Chr00c40g0409281"; Parent "nbis-gene-2"; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409281"; original_biotype "rna";
WOCE01000065.1  Genbank exon    417 575 .   -   .   gene_id "Lalb_Chr00c40g0409281"; transcript_id "Lalb_Chr00c40g0409281"; ID "exon-2"; Parent "Lalb_Chr00c40g0409281"; exon_number "1"; gbkey "ncRNA"; locus_tag "Lalb_Chr00c40g0409281"; product "hypothetical ncRNA";