Hi,
I have a reference fasta protein database (~ 1M lines) which contains a mix of uniprot amino acid sequence and some of my own protein sequence. It looks something like this:
>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens GN=YWHAE PE=1 SV=1
MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISS
>my_peptide_43624534
GNTSKTDEQFIHQECIAKSSLWKYTKITKSNVTSYQILWSCSASIDFCFIFYLNLLAGRFALLNTLTATRLLLCW
I also have a list (~ 1k lines) of unannotated amino acid sequence, which looks like this:
-unknown_pep1 ECIAKSSLWKY
-unknown_pep2 SNVTSYQILWSCS
I am trying to search the unknown amino acid sequence against the reference fasta file and annotate the unknwon peptide with either a uniprot name or "my_peptide" name. I am a python user, and I tried to load the reference file into a pandas data frame, and then use str.contains()
to locate that specific peptide in the fasta, but it takes forever to load the fasta into pandas as it's just too big. I am thinking about use df.readline()
to iterate the fasta, but still it will be 1M*1k iterations. Does anyone have a good idea of how to work this problem around fast?
Thanks!
Robin