Hi,
I have a fasta file with sequences like the following. The pair of sequences have a similar header. I want to generate a file with the sequences which have a header with no "shuffled". How to do that in bash?
>AABR03119176.1/72910-72785
UCCCCCAGAGUCUGGGCUUGGUGCUUUGCAGUGCUGGCGACCUAUUCCCUUUGACGAUCCCUAGGUGGAGAUGGGGCAUGAGGAUCCUCCAGGGGAAUAGCUCACCGCCACUGGGCAACAGGCCUA
>AABR03119176.1/72910-72785-shuffled
CCGCUAGCGUGAUUGGGGACGGGAUCGACCGGUGGCCCGCCGACGCCUCACCUCAUACUCGUAUGUGAUGCCGAGGGCUAGGUAAGAUGGUUGAACGCUCUAGAGUGCCCUCUGAACUUAGCCUCU
>AANN01820944.1/1549-1423
UUUCCCUCAGAAUAGGCUUGUUGCUUUACAGUACUGGUGAUCCAUUCUCUUUGAUGAUCCCcUAGGUGGAGAUGGGGCAUGAGGAUCCUCCAAGGGAAAGACUCAUCAUCACUGGGCAACAGCCUUA
>AANN01820944.1/1549-1423-shuffled
AGGCUCUGACAUAGACUCUUCUUUAGUGGGCGCGCCGACACAUACCUGUcUGAGGAGAUCGAAAUGUGUAGUCCGACAGAACUAAACAAGACUCGUCGGUGCUUAGACUUCUUUCCUGUUUGCGAUU
try these:
$ sed '/^>/ s/-shuffled$//' test.fa
or$ awk -F "-shuffled" '{print $1}' test.fa
or$ awk -v RS=">" -v OFS="\n" 'NR>1 {sub("-shuffled$","",$1); print ">"$1,$2}' test.fa
.But you will have sequences with identical headers. Somewhere else, this could be a problem.