Question

How to deal with the spaces in the sequence names with biopython?

0

Entering edit mode

10.2 years ago

grayapply2009 ▴ 300

I have a fasta file formatted as follows:

>UPF0471 protein C1orf63 homolog

some sequence

>WD repeat-containing protein 43

some sequence

>transmembrane protein 41A

some sequence

When I print out record.id or make dictionaries, biopython cannot handle the spaces in the sequence names. What should I do to let biopython recognize the name as whole rather than just taking the first word of the name?

sequence space name biopython • 3.7k views

ADD COMMENT • link updated 10.2 years ago by Peter 6.0k • written 10.2 years ago by grayapply2009 ▴ 300

0

Entering edit mode

Replace the spaces with "_" or "-"?

ADD REPLY • link 10.2 years ago by pld 5.1k

0

Entering edit mode

You'll find most tools will take the same attitude to spaces and FASTA identifiers, so good idea!

ADD REPLY • link 10.2 years ago by Peter 6.0k

score 2 · Answer 1 · 2015-05-14

2

Entering edit mode

10.2 years ago

Damian Kao 16k

You can get the whole header by using record.description

ADD COMMENT • link 10.2 years ago by Damian Kao 16k

score 1 · Answer 2 · 2015-05-14

1

Entering edit mode

10.2 years ago

Peter 6.0k

Answering your second question, how to make a dictionary using SeqIO.to_dict with the full descriptions with spaces as keys - you would need to use the key_function as help(to_dict) tries to explain, e.g.

my_dict = to_dict(sequences, key_function=lambda rec: rec.description)

ADD COMMENT • link 10.2 years ago by Peter 6.0k

score 0 · Answer 3 · 2015-05-14

0

Entering edit mode

10.2 years ago

grayapply2009 ▴ 300

Then how do I make dictionaries with SeqIO.to_dict?

ADD COMMENT • link 10.2 years ago by grayapply2009 ▴ 300

0

Entering edit mode

This isn't an answer - it is a new question, or an addendum to your old question?

ADD REPLY • link 10.2 years ago by Peter 6.0k