I'm having some problems trying to change headers of hundreds of fasta files. Each fasta file is a gene sequence for several species, but for some species header is different for each gene, for example:
>EOG7B0H7N|Ectop_sp|C227106_a_3_0_l_281|.|.|Zoractes_sp
>EOG7B0H7P|Ectop_sp|s6255_L_9215_0_a_59_2_l_2007|.|.|Zoractes_sp
>EOG7B0H87|Ectop_sp|C242868_a_14_0_l_390|.|.|Zoractes_sp
>EOG7B3CGS|Ectop_sp|C272142_a_50_0_l_1449|.|.|Zootermopsis_sp
>EOG7B67Q7|Ectop_sp|C265168_a_16_0_l_886|.|.|Zoractes_sp
The structure of fasta files, for the first gene is something like
>sp1
>sp2
>sp3
>EOG7B67Q7|Ectop_sp|C265168_a_16_0_l_886|.|.|Zoractes_sp
>sp4
I want to rename header for this species, that contains for example, the name Ectop_sp only:
>sp1
>sp2
>sp3
>Ectop_sp
>sp4
Thanks for the help.
To summarize, for all description lines in your FASTA files that contain
Ectop_sp
somewhere, you'd like the entire description line reduced to just the following?:Are your FASTA files in a directory all alone? What extension do they have? The end of the name is all the same so you can easily iterate on them to do the replacement as opposed to any files you'd not want to touch?
1- Yes the .fasta are a in directory called "header" 2- the structure of header is the same, but in all the fasta files I have different species with the same header structure as for Ectop_sp:
I use some .py that I founded somewhere, it works for one of mi files, but I don't know how to used with al the 2160 genes I have:
The reason I asked for clarification about the first part is because I think you can do that part with simply a find-and-replace using a regular expression, essentially fancy find-and-replace, to look for any lines that begin with
>
and then have any number of characters, withEctop_sp
in there followed by any number of letters and spaces then process the find and replace. I've been meaning to try sd that is supposed to be easier to use and faster than sed for this sort of thing. And faster than Python. However, usually for things like this the speed isn't overly critical as under a minute vs. 20 minutes isn't that big of a difference if you are only doing it once. So it looks like you solved that part. (By the way, I was going to point you to a temporary MyBinder session where you'd be able to run sd, since I didn't want to make you install it on your machine. You would have needed to archive/compress your directory and then upload it to the session. And then download your results after.)My follow-up question was meant to address the looping over the files applying the main processing step to each. The looping on the files in the directory for that I was going to suggest in Python using
glob
orfnmatch
modules to look for any and all fasta file in the directory and execute the find-and-replace on them. I'm just more used to doing that in Python; however, it looks like caleb solved that too with a nice bash loop.So if you get stuck let me know if you'd like my version that would work on a remote temporary session.