Dear all,
My question might be quit specific but hopefully there is someone that might have some advice for me. I am trying to upload a .sqn file to the NCBI genome database. I received the error that my sequence/gene headers are too long. The headers are automatically generated by the annotation software and ideally I do not want to change these in my original GFF and genome assembly files. But how do I change the headers in the .sqn only? I could simply create a list of long names that I rename into "gene1, gene2, gene3" etc.
But I never worked with a .sqn file so I would like to ask you if the above would be possible?
One of the error lines look like this:
ERROR: File: sexigua_aseembly_annot.sqn, Code(SEQ_INST_BadSeqIdFormat), Sequence-id: gnl|HF086|augustus_masked-tulip_contig_306_pilon_pilon-processed-gene-0.1-mRNA-1:cds, General identifier longer than 50 characters
Where "augustus_masked-tulip_contig_306_pilon_pilon-processed-gene-0.1-mRNA-1" should be replaced by something shorter.
Thanks in advance for your ideas and suggestions!
If you do that would that not throw your entire dataset out of sync?
This sounds pretty non-informative/important. Can that not be replaced everywhere? What does
0.1
refer to?On your first question, I was afraid of that too. But this .sqn file is the only file I upload so it contains the genome assembly (does not get changed) and the information of the GFF file. Because I do want to edit the gene headers in all occasions within this file I think it should be working out fine (see comment below). At least worth a trial.
Just to add, I have created a text file with in the first column the old gene name and in the second column the new gene name:
Ideally, with a sed like command each occurrence in the .sqn file of the first column gets replaced by the name in the second column. Not sure if thats how a .sqn file works but worth a trial. If anyone knows how to use sed in combination with a separate file to replace 1000s of strings than please let me know! I will post it here if I have found a solution too.
Even if the
.sqn
file gets through NCBI's initial check the ID's would then not match GTF/Sequence files. Are you replacing these ID's in all files?But what I understood from the NCBI tutorials (but it is quit hard to get everything!) is that this .sqn file is the only file I need to upload and is derived from the assembly and the GFF file. Am I right here? If I still need to upload the GFF file also, than indeed this will not work.
I actually now think to do the gene name replacement on the original GFF file. In that case I am sure that the resulting .sqn is built correctly and I have not accidentally changed anything in the .sqn that shouldn't be changed!
That sounds like a good plan.