hi, I successfully aligned and analyzed my RNA-Seq data using Hisat2 and StringTIe and DESeq2 with the La_Amiga3_1 genome (white lupin) from NCBI to map transcripts. Beginner's luck. Now I am trying to do the exact same thing using the CNRS_Lalb genome (also white lupin on NCBI), and when I get to the first StringTIe step, I get "Error: no valid ID found for GFF record". I have looked at both the genome GTF files, and the first field (chromosome id) looks great (cut -f 1 *.gtf | sort | uniq) and they have a different name for the chromosomes, but look fine. I don't think that is the problem, and am looking for more hints as to what this means - I did read the StringTie manual but need more help. thanks very much,
K
omg, thanks so much. this enabled me to find a prior post: Ensembl GTF format: isn't the tag "transcript_id" mandatory?
in which Ensembl explains the evolution of GTF format and suggests exactly what you suggest:
"I would recommend removing the gene lines from the gtf file".
This gets me back on track so fast, I appreciate it!
- Karen
My guess is its those lines with transcript_id=="", they don't contain a valid ID, and so StringTie is complaining. Its always a bit of the worry to work out what to do with a the transcript_id field on gene lines in a GTF file. The orignal GTF format didn't contain gene lines, but they appear to have crept in at some point. The ENSEMBL files just don't have a transcript_id field on their gene lines, but i bet that trips StringTie up as well.
For for what to do:
I recommend just removing the gene lines. They are not necessary anyway. Something like:
omg, thanks so much. this enabled me to find a prior post: Ensembl GTF format: isn't the tag "transcript_id" mandatory?
in which Ensembl explains the evolution of GTF format and suggests exactly what you suggest:
"I would recommend removing the gene lines from the gtf file". This gets me back on track so fast, I appreciate it! - Karen
Can you please post a couple of lines of the GTF file?
sure, thanks: