Entering edit mode
6.3 years ago
oseias.rf.junior
•
0
Hi to all,
I'm trying to use a script to find IS on my genomes. But the script will run only on singlefasta files (i.e., complete genomes). Some of the genome files I have are multifasta files. I wonder if there is any python (or perl) script to remove all the contig headers of a file (e.g. ">contig-header"), BUT the first one, replacing all the headers for something like "NNNNNNNNNN" so I would be able to either map where the headers were before, but also to use the script I first mention only with the purpose of looking for IS on the multifasta2singlefasta files.
Changing data drastically so it would work with a script seems like a really bad idea to me.
Why not split the multi-fasta files into individual ones and run the tool on those files instead of doing what you are proposing?
Because some genomes have something like 300 contigs. It doesn't seem to me an idea with practicality (split them all). I would probably loose the track of each contig of a single file. The script for IS gives me the coordinates so I can track the IS local later.
Then you do not need the contig headers anyway. The coordinates will still be correct. Your proposed approach would actually be more difficult, since you’d be replacing the headers with an arbitrary number of Ns (and possibly different numbers of Ns depending on how you did the substitution), and so your base indices would be completely meaningless.
It didn't need to be an arbitrary number of Ns.
But anyway, thank you both very much genomax, jrj.healey and Ram for give me some clues and advices in a so fast way. I'll follow what you guys indicated/wrote.
Please search the forum for these various tasks, each one has very well documented solutions.
e.g.:
Concatenating sequences
Methods for manipulating fasta headers (one of many)
Give them a try. If you can’t make a solution work, come back and show us what you’ve tried and where you are stuck.
BTW,
An example of what I want the python script might do:
I’m with Ram and the others, this seems like a bad approach. Just concatenate the file normally, as a separate file and keep the original.
Concatenating draft genomes (or more usually scaffolding them) is fairly normal.
To add to jrj.healey's point, always design tools to write to stdout or an explicitly specified output file. Modifying an input file is unexpected (verging on harmful) behavior.
It's not my script (the one who finds IS in complete genomes). Someone designed it like I described. I'm just trying to figure something to have a first look on my data (the ones there are not complete genomes).
OK, then I'd recommend creating a new single fasta file and working on that.
What is IS?
probably insertions