I want to run some data through the isoseq pipeline using the T2T genome.
I downloaded the genomes from https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/. Specifically I downloaded
- GCF_009914755.1_T2T-CHM13v2.0_genomic.gtf
- GCF_009914755.1_T2T-CHM13v2.0_genomic.fna
I tried running pigeon prepare using the command
pigeon prepare GCF_009914755.1_T2T-CHM13v2.0_genomic.gtf GCF_009914755.1_T2T-CHM13v2.0_genomic.fna
It gave me the error
| 20240918 15:52:07.087 | FATAL | pigeon prepare ERROR: GFF/GTF file error, improperly formatted record reason : empty record ID record : NC_060925.1 BestRefSeq gene 7506 138480 . - . gene_id "LOC127239154"; transcript_id ""; db_xref "GeneID:127239154"; description "uncharacterized LOC127239154"; gbkey "Gene"; gene "LOC127239154"; gene_biotype "lncRNA"; partial "true"; See format documentation at https://isoseq.how
I thought that the issue might be the gene name was listed as gene rather than gene_name so I tried changing that using awk
awk '{gsub(/"; gene "/,"; gene_id "); print}' GCF_009914755.1_T2T-CHM13v2.0_genomic.gtf > GCF_009914755.1_T2T-CHM13v2.0_genomic_edited.gtf
I ran pigeon on the resulting file and still got functionally the same error (only with the gene changed to gene_name). I don't know what else is wrong with the format. I don't know what it means by the record id. Has anyone seen this before?