Hi all;
sorry for my lack of knowledge. I'm new to biopython and looking through the manual, there's so much information I don't know where to begin. Basically, my FASTA files look like this:
>Ribosomal_L16___Rinke_et_al___edaecdd15ca8bf2bacd09a738e78218887c932d5f415ce2dd05b8fa9|bin_id:Acidianus_Hospitalis_W1.fixed|source:Rinke_et_al|e_value:3.2e-29|contig:c_000000000001|gene_callers_id:1239|start:1112992|stop:1113442|length:450
ATGCCAAAA…
I'm looking to take a list of protein names and I'm trying to search for, find, and print out the resulting sequence from said name. That is to say, if I were to look up "Ribosomal_L16", it would print out "ATGCCAAAA…".
This seems simple enough, except for that most of the title outside of the protein name (i.e. "Ribosomal_L16") is completely variable. Does anyone have a good place for me to start thinking about how to solve this issue?
thanks so much.
You should be more informative with regard to your fasta input file.
It's unclear if the protein name (
Ribosomal_L16
etc) is:string.startswith()
)___
(python:string.split('___')[0]
)I've recently learned how to use biopython, and python for that matter. Sometimes I found myself overthinking the problem I had, and thought it was a biology problem. There is a much easier way to approach this by string matching. If the '__' is conserved, you could split the name at that point, and search for that part of the name (it might be sequence.name, i don't remember).
Have you written a snippet of python code you can show?