I have a fasta sequence in a file called "seq.fasta", which I want to search into multiple multi-fasta files located in a directory "/home/fasta_sequences" and print all the matching sequences to a file "output.fasta".
seq.fasta file looks like:
>KNH123.1_gallus_gallus
NDEAHPWKPLRPGDIRGPCPGLNTLASHGYLPRNGVATPVQIINAVQEGLNFDNQAAVFA
TYAAHLVDGNLITDLLSIGRKTRLTGPDPPPPASVGGLNEHGTFEGDASMTRGDAFFGNN
HDFNETLFEQLVD
/fasta_sequences/ directory looks like:
Homo_sapiens.AG_v1.pep.all.fa
Aspergillus_aculeatus_atcc_16872.Aspac1.pep.all.fa
.
.
* Homo_sapiens.AG_v1.pep.all.fa* file looks like:
>OJJ94681_Homo_sapiens
MKITPIPVGVVGENWANIHDPVERRRAQNRIAQRGYRRRLRKGEERNTAQASAATLPSPP
QQQHDQVLPPPPLSLPPPLQSYSLHPPTPPTLPSDVSPYTSPADPLMFTGEPWDPPLSLD
VLFPDPVAYSAYPSPEPTSSGSPTLTSTTEGRSGPAHGHTALPLALPLPDLLHEPTPGLT
HRAALHGHEWTPLHLAARQGHTRVVQLLLTSDVHHRGAVNSVLAVLLRAGGDPSLPDSAG
RTTRRGETAFHLAVQARQHAVVRVLLEHPASYYSVNAQDCWGRTPLHLACESNQQELVEM
LILAGAQLDIRDFEDQTPLHSACLGG
>OJJ94682_Homo_sapiens
AQERKARLYAVFGGQGNDEHYFEELRTLWQTYRPLVEELIIPASSLLQRESMGGGSSRFY
HNGLDALSWLEHPERTPASAYLIGAPISMPLIGLLQLAWYRVVGRVAGVTPAQLQTALAG
TTGHSQGILPATVIAVAQSWTQFDALALDALRVLLAIGACSQDQFAVAQLPPAIVANAEA
HGEGFPGPMLSVADFPVAQVRSYVADVNAHLPEDARIEIALINGPHNIVLAGPPLSLHGF
NVWLRAVKAPASGVQHRIPFSQRRPEIRNRFLPITAAFHSSYLAEVVPEVLARVPGLTIS
GEQLRIPVYCTRTGLDLRTRGAANLVPELVAMITEQPLVWPKATQFPGATHILAFGPEGS
AGIGSLTNRLKEGTGVRTILASGGHHSHSARAEQLGDRS
.
.
I know how to search for a string in multiple files
#!/usr/bin/perl
{
local @ARGV = glob "/home/fasta_sequences/*.fa";
while (<>) {
print "$ARGV:$.:$_" if /TTHACVGSYYLSEDQKKLMNIFFTYRP/;
} continue {
close ARGV if eof;
}
}
But I am unable to find a way to do the same with a complete fasta sequence.
Any help would be appreciated!
Hello MB,
you haven't just have the problem to find the sequence in another file. You also have to find out in which sequence it is contained if you have multiple sequences in one file.
So my question is if you realy want/need to use perl or if something like standard unix tools or e.g. seqkit is also suitable?
fin swimmer
Thanks for your answer. I will try to use grep command from seqkit.