Aloha,
I am having trouble figuring out how to remove everything after the last '_' in the sequence headers of a fasta file.
I would like this following series of headers
>ART01B_100_M7_ID100005_1
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT
>PAG05A_100_M7_ID102325_189
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT
>KIN05B_100_M7_ALT_ID230005_46
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT
to look like this:
>ART01B_100_M7_ID100005
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT
>PAG05A_100_M7_ID102325
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT
>KIN05B_100_M7_ALT_ID230005
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT
because there are various '_' in the sequence headers, this command, for example, won't work:
cat Alt_MACSE_Output.fasta | awk -F _ '/^>/ { print $1"_"$2"_"$3"_"$4 } /^[A-Z]/ {print $1}' > Alt.fasta
Can anyone help me please?
Hello and welcome to biostars timmers ,
Please use the formatting bar (especially the
code
option) to present your post better. I've done it for you this time.Thank you!