Question

Changing gene headers in .sqn file for genome upload NCBI

0

Entering edit mode

4.4 years ago

T_18 ▴ 50

Dear all,

My question might be quit specific but hopefully there is someone that might have some advice for me. I am trying to upload a .sqn file to the NCBI genome database. I received the error that my sequence/gene headers are too long. The headers are automatically generated by the annotation software and ideally I do not want to change these in my original GFF and genome assembly files. But how do I change the headers in the .sqn only? I could simply create a list of long names that I rename into "gene1, gene2, gene3" etc.

But I never worked with a .sqn file so I would like to ask you if the above would be possible?

One of the error lines look like this:

ERROR: File: sexigua_aseembly_annot.sqn, Code(SEQ_INST_BadSeqIdFormat), Sequence-id: gnl|HF086|augustus_masked-tulip_contig_306_pilon_pilon-processed-gene-0.1-mRNA-1:cds, General identifier longer than 50 characters

Where "augustus_masked-tulip_contig_306_pilon_pilon-processed-gene-0.1-mRNA-1" should be replaced by something shorter.

Thanks in advance for your ideas and suggestions!

NCBI sqn UNIX • 1.5k views

ADD COMMENT • link updated 20 months ago by Ram 44k • written 4.4 years ago by T_18 ▴ 50

0

Entering edit mode

The headers are automatically generated by the annotation software and ideally I do not want to change these in my original GFF and genome assembly files. But how do I change the headers in the .sqn only?

If you do that would that not throw your entire dataset out of sync?

pilon_pilon-processed-gene-0.1

This sounds pretty non-informative/important. Can that not be replaced everywhere? What does 0.1 refer to?

ADD REPLY • link 4.4 years ago by GenoMax 147k

0

Entering edit mode

On your first question, I was afraid of that too. But this .sqn file is the only file I upload so it contains the genome assembly (does not get changed) and the information of the GFF file. Because I do want to edit the gene headers in all occasions within this file I think it should be working out fine (see comment below). At least worth a trial.

ADD REPLY • link 4.4 years ago by T_18 ▴ 50

0

Entering edit mode

Just to add, I have created a text file with in the first column the old gene name and in the second column the new gene name:

> augustus_masked-tulip_contig_306_pilon_pilon-processed-gene-0.1-mRNA-1    gene1
> maker-tulip_contig_306_pilon_pilon-augustus-gene-0.12-mRNA-1  gene2
> maker-tulip_contig_306_pilon_pilon-augustus-gene-0.13-mRNA-1  gene3

Ideally, with a sed like command each occurrence in the .sqn file of the first column gets replaced by the name in the second column. Not sure if thats how a .sqn file works but worth a trial. If anyone knows how to use sed in combination with a separate file to replace 1000s of strings than please let me know! I will post it here if I have found a solution too.

ADD REPLY • link updated 4.4 years ago by GenoMax 147k • written 4.4 years ago by T_18 ▴ 50

0

Entering edit mode

Even if the .sqn file gets through NCBI's initial check the ID's would then not match GTF/Sequence files. Are you replacing these ID's in all files?

ADD REPLY • link 4.4 years ago by GenoMax 147k

0

Entering edit mode

But what I understood from the NCBI tutorials (but it is quit hard to get everything!) is that this .sqn file is the only file I need to upload and is derived from the assembly and the GFF file. Am I right here? If I still need to upload the GFF file also, than indeed this will not work.

ADD REPLY • link 4.4 years ago by T_18 ▴ 50

0

Entering edit mode

I actually now think to do the gene name replacement on the original GFF file. In that case I am sure that the resulting .sqn is built correctly and I have not accidentally changed anything in the .sqn that shouldn't be changed!