Changing gene headers in .sqn file for genome upload NCBI
0
0
Entering edit mode
4.4 years ago
T_18 ▴ 50

Dear all,

My question might be quit specific but hopefully there is someone that might have some advice for me. I am trying to upload a .sqn file to the NCBI genome database. I received the error that my sequence/gene headers are too long. The headers are automatically generated by the annotation software and ideally I do not want to change these in my original GFF and genome assembly files. But how do I change the headers in the .sqn only? I could simply create a list of long names that I rename into "gene1, gene2, gene3" etc.

But I never worked with a .sqn file so I would like to ask you if the above would be possible?

One of the error lines look like this:

ERROR: File: sexigua_aseembly_annot.sqn, Code(SEQ_INST_BadSeqIdFormat), Sequence-id: gnl|HF086|augustus_masked-tulip_contig_306_pilon_pilon-processed-gene-0.1-mRNA-1:cds, General identifier longer than 50 characters

Where "augustus_masked-tulip_contig_306_pilon_pilon-processed-gene-0.1-mRNA-1" should be replaced by something shorter.

Thanks in advance for your ideas and suggestions!

NCBI sqn UNIX • 1.5k views
ADD COMMENT
0
Entering edit mode

The headers are automatically generated by the annotation software and ideally I do not want to change these in my original GFF and genome assembly files. But how do I change the headers in the .sqn only?

If you do that would that not throw your entire dataset out of sync?

pilon_pilon-processed-gene-0.1

This sounds pretty non-informative/important. Can that not be replaced everywhere? What does 0.1 refer to?

ADD REPLY
0
Entering edit mode

On your first question, I was afraid of that too. But this .sqn file is the only file I upload so it contains the genome assembly (does not get changed) and the information of the GFF file. Because I do want to edit the gene headers in all occasions within this file I think it should be working out fine (see comment below). At least worth a trial.

ADD REPLY
0
Entering edit mode

Just to add, I have created a text file with in the first column the old gene name and in the second column the new gene name:

> augustus_masked-tulip_contig_306_pilon_pilon-processed-gene-0.1-mRNA-1    gene1
> maker-tulip_contig_306_pilon_pilon-augustus-gene-0.12-mRNA-1  gene2
> maker-tulip_contig_306_pilon_pilon-augustus-gene-0.13-mRNA-1  gene3

Ideally, with a sed like command each occurrence in the .sqn file of the first column gets replaced by the name in the second column. Not sure if thats how a .sqn file works but worth a trial. If anyone knows how to use sed in combination with a separate file to replace 1000s of strings than please let me know! I will post it here if I have found a solution too.

ADD REPLY
0
Entering edit mode

Even if the .sqn file gets through NCBI's initial check the ID's would then not match GTF/Sequence files. Are you replacing these ID's in all files?

ADD REPLY
0
Entering edit mode

But what I understood from the NCBI tutorials (but it is quit hard to get everything!) is that this .sqn file is the only file I need to upload and is derived from the assembly and the GFF file. Am I right here? If I still need to upload the GFF file also, than indeed this will not work.

ADD REPLY
0
Entering edit mode

I actually now think to do the gene name replacement on the original GFF file. In that case I am sure that the resulting .sqn is built correctly and I have not accidentally changed anything in the .sqn that shouldn't be changed!

ADD REPLY
1
Entering edit mode

That sounds like a good plan.

ADD REPLY

Login before adding your answer.

Traffic: 1680 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6