Entering edit mode
7.2 years ago
Mehmet
▴
820
Dear all,
I have an assembly which has ~1000 scaffold, and has a gff3 and gtf file. I would like to ask you how to generate embl file of each scaffold of the assembly?
Thank you. I tried, and I would like to ask you how to give scaffold names into IDs in the output file?
I have several questions about your script.
How can I provide protein file to use --translate option? What argument does this option accepts?
In output file how can I give unique IDs based on scaffold name?
an example of output file. I want to put scaffold00001 as ID, not XXX. how can do that?
Hi, 1 - I don't get what you want to do ... Or maybe our explanations are nor clear enough. Let's clarify that. You should provide the fasta of the assembly, so DNA sequences. There is no way to pass a protein file to the tool. Using the --translate option will add the translation of the CDS contained in your GFF. It's a boolean nothing else to add.
2 - actually the tool is currenlty giving the accession as prefix for the ID. I' am currently modifying that to use a locus_tag option to define it, it will be clearer. But I don't plan to implement the possibility to use the scaffold name as part of the locus tag. Only the use of the locus_tag given by ENA is mandatory. The rest is an arbitrary choice we made (locus1, locus2, locus3 ...).
Hi,
For the second question, For instance in my fasta file:
in my gff file scaffold01 scaffold01
what I would like to ask is how can ID in embl file can be done in order fasta file? ID scaffold01 ID scaffold02 etc.
Currently it's not possible, but I will see if I can implement something like that.
I was able to. Sorry for that.
Excellent, could you share your trick ?
Yes, of course. It is easy to use the script. The script saved my life. I have searched many scripts and tools to convert to embl format. Thank you for providing this script to us. It took ~5 minutes for 75 Mbp assembly and ~17000 gene models to generate embl file and did not give any error during running.
For explanation, more information can be given for beginners. For example;
explanation of "transl_table".
most important thing is IDs of scaffold that I mentioned before. If you can provide this, it would be much better to split big embl file into small files that can be used for downstream analyses.
Once again, thank you for writing this script.
Thank you very much for your feedback. I will try to improve the help. Actually I mixed up the different terms. When you said "ID" I thought "locus_tag". I just realised you were talking about the ID line... My fault. So, yes there is no way to have the contig name into the ID line otherwise it will break the EMBL rules and not be a valid emboli flat file. Nevertheless, what I can do easily is to add the contig name in the DE. It was previously like that. And it's compatible with the format. It could help to split big embl file into small files.
Could you format it properly, it's hard to see like that.