Entering edit mode
5.7 years ago
imda
▴
10
Hi everyone! I want to remove one part of my fasta headers, could somebody help me??? please
I have this:
>Capsicum_annuum_cvCM334_CA01g24260 PREDICTED_ transmembrane 9 superfamily member 4-like [Solanum tuberosum]
MIAWISFPAVPLNFFGKEVVFSFATAVDNPLHVDLATQNKTRPSCAKVKMKINLLGEFPKRINVGMRMKTREVKEKGVNISYDYVPKYCKTFKLQDYNEKECFILHPKLYPKD
and I just want this part
>CA01g24260
MIAWISFPAVPLNFFGKEVVFSFATAVDNPLHVDLATQNKTRPSCAKVKMKINLLGEFPKRINVGMRMKTREVKEKGVNISYDYVPKYCKTFKLQDYNEKECFILHPKLYPKD
or
>CA01g24260 PREDICTED_ transmembrane 9 superfamily member 4-like [Solanum tuberosum]
MIAWISFPAVPLNFFGKEVVFSFATAVDNPLHVDLATQNKTRPSCAKVKMKINLLGEFPKRINVGMRMKTREVKEKGVNISYDYVPKYCKTFKLQDYNEKECFILHPKLYPKD
In my same fasta file, I have other sequences which are not in the same format as the sequence above:
>Capsicum_annuum_glabriusculum_Capang01g001768
VNKMENLQFVVIGKFTUEWTDIEELRNIIPQQCNINGDCQIGVFRNKHIF
IRUTQESDFINIIISKGAFYLHCKDVLLFDANTLTLIYDPLFKETLETTK
TFAWISFPNLLPTYYVKECLFSLAATVGKPVQLDLATINRTRPSCARIKV
LVDLKADFSKSVRMDIENEESGKCRTIVKRIKFDHIPKYCHECNMQVHAK
NQCRNL
But in general, I just want the last part:
>Capang01g001768
VNKMENLQFVVIGKFTUEWTDIEELRNIIPQQCNINGDCQIGVFRNKHIF
IRUTQESDFINIIISKGAFYLHCKDVLLFDANTLTLIYDPLFKETLETTK
TFAWISFPNLLPTYYVKECLFSLAATVGKPVQLDLATINRTRPSCARIKV
LVDLKADFSKSVRMDIENEESGKCRTIVKRIKFDHIPKYCHECNMQVHAK
NQCRNL
Because the program that I am using add the name of the species to the ID.
I appreciate your answers, your scripts worked well for some kinds of sequences but not for all. The problem is that the headers of my sequences are not uniform. I have thirteen kinds of sequences (from different species = different headers). I want to extract the headers to get the CDS from another fasta file to carry out selection analysis. Therefore, I need that the headers can match with the headers of my CDS fasta file. For some reason, a previous analysis adds the name of the species to the original sequences headers.
These are the thirteen kinds of different sequences that I have and I am pointing out the header that I need:
Required
input
Required
input
Required
input
Required
input
Required
input
Required
input
Required
input
Required
input
Required
input
Required
input
Required
input
Required
input
Required
Thank for providing the detailed examples. This is not a trivial task, because there is no clean pattern for the names you like to keep.
So you have a file with "original sequence headers"? How does they look like there?
Hi! I have a .fasta file with proteins from every species. They look like this:
However, the program that I used to detect orthologues can give me also all the proteins sequences that belong to each ortogroup or gene family. Therefore, I want to carry out some analysis using Hyphy program, but this program required CDS sequences to work. So I also have all the CDS for each species. I need to use the headers from all the sequences that belong to each gene family (from Orthofinder) in order to obtain the CDS.
you would need seqkit to linearize your fasta file.
input:
output:
Dear cpad0112, Could you help me with the questions that I pointed out below. Thank you.