Question

Question about searching for/manipulating FASTA sequences in Biopython

0

Entering edit mode

7.7 years ago

jon.sy.tarn ▴ 10

Hi all;

sorry for my lack of knowledge. I'm new to biopython and looking through the manual, there's so much information I don't know where to begin. Basically, my FASTA files look like this:

>Ribosomal_L16___Rinke_et_al___edaecdd15ca8bf2bacd09a738e78218887c932d5f415ce2dd05b8fa9|bin_id:Acidianus_Hospitalis_W1.fixed|source:Rinke_et_al|e_value:3.2e-29|contig:c_000000000001|gene_callers_id:1239|start:1112992|stop:1113442|length:450
ATGCCAAAA…

I'm looking to take a list of protein names and I'm trying to search for, find, and print out the resulting sequence from said name. That is to say, if I were to look up "Ribosomal_L16", it would print out "ATGCCAAAA…".

This seems simple enough, except for that most of the title outside of the protein name (i.e. "Ribosomal_L16") is completely variable. Does anyone have a good place for me to start thinking about how to solve this issue?

thanks so much.

sequence biopython • 1.7k views

ADD COMMENT • link updated 7.7 years ago by WouterDeCoster 47k • written 7.7 years ago by jon.sy.tarn ▴ 10

1

Entering edit mode

You should be more informative with regard to your fasta input file.
It's unclear if the protein name (Ribosomal_L16 etc) is:

Always first in the fasta identifier (python: string.startswith())
Always separated by ___ (python: string.split('___')[0])
Present once, multiple times or potential absent from your fasta file
...

ADD REPLY • link 7.7 years ago by WouterDeCoster 47k

0

Entering edit mode

I've recently learned how to use biopython, and python for that matter. Sometimes I found myself overthinking the problem I had, and thought it was a biology problem. There is a much easier way to approach this by string matching. If the '__' is conserved, you could split the name at that point, and search for that part of the name (it might be sequence.name, i don't remember).

protein_name = str ( sequence.id).split('__')[0]

ADD REPLY • link updated 7.7 years ago by WouterDeCoster 47k • written 7.7 years ago by pawlowac ▴ 80

0

Entering edit mode

I'm new to biopython and looking through the manual Does anyone have a good place for me to start thinking about how to solve this issue?

Have you written a snippet of python code you can show?

ADD REPLY • link 7.7 years ago by st.ph.n ★ 2.7k

score 0 · Answer 1 · 2017-03-21

0

Entering edit mode

7.7 years ago

Ram 44k

I'd look at Heng Li's bioawk - it combines awk with built in parsers for biological formats, so you can use regular expressions to sift through FASTA components such as the title.

ADD COMMENT • link 7.7 years ago by Ram 44k

score 0 · Answer 2 · 2017-03-21

0

Entering edit mode

7.7 years ago

shenwei356 8.7k

try seqkit:

seqkit grep -n -r  -i -f protein_names.txt seqs.fasta | seqkit seq -s -w 0

seqkit grep -n -r -i -f protein_names.txt means searching FASTA records (grep) of which the full header (-n/--by-name) match any of regular expression (-r/--use-regexp) from files (-f/--pattern-file), cases ignored (-i/--ignore-case)
seqkit seq -s -w 0 means only print sequences (-s/--seq), and do not wrap sequences to fixed length (-w/--line-width)

ADD COMMENT • link 7.7 years ago by shenwei356 8.7k

0

Entering edit mode

I'd recommend not giving out close-to-exact commands - users are better off learning tools on their own. Especially here, when the OP says

Does anyone have a good place for me to start thinking about how to solve this issue?

giving a command is unadvisable.

ADD REPLY • link 7.7 years ago by Ram 44k

0

Entering edit mode

Given this, advices on the specific programming language, library, and programming pattern are the best appropriate solution, so is . :P