Question

Comprehension To Extract Fasta Sequences From Dataset

1

Entering edit mode

11.3 years ago

s.charonis ▴ 100

Dear BioStar Community,

I am analyzing a dataset of bacterial proteins structured as follows:

'>PDB1a0na_unknown

PPRPLPVAPGSSKT

'>PDB1a1ta_ENZ

MQKGNFRNQRKTVKCFNCGKEGHIAKNCRAPRKKGCWKCGKEGHQMKDCTERQAN

etc ....

This dataset has about 15000 entries. What I am trying to do is extract all proteins according to their annotations (ENZ, MEM, unknown, etc.) to perform sequence/structure analysis on them. I am using a list comprehension to do this as follows (ENZ annotation shown here):

def retrieve_enzyme_proteins(filename):
    with open(filename) as file:
        for line in file:
           # search for the proteins annotated as enzymes (ENZ)
            if line[0]='>' and '_ENZ' in line:
                # I need a line here telling the function to jump to the next line and extract the sequence
                return [line.strip('\n') for line in file if line[0] != '>']

I have only been able to extract EVERY protein sequence so far and would like a logical structure that tells my program to look for the desired annotation (e.g. ENZ) and then jump to the next line where its sequence begins and extract that sequence? Apologies if the indentation is messed up, the copy-paste does that. I am using Python 2.7.3. Any help would be much appreciated!

python fasta • 5.4k views

ADD COMMENT • link updated 11.1 years ago by mayingfei ▴ 10 • written 11.3 years ago by s.charonis ▴ 100

score 2 · Answer 1 · 2013-09-10

Sorry, I got sidetracked by the new iPhone announcements. Extending the code you provided leads me to this:

def retrieve_enzyme_proteins(filename, filt):
    """Open a fasta file, read line by line and determine if line is a header. If line 
    is a header, store annotation and write out existing sequence if annotation type is in 
    `filter`.

    filt: a list of annotations we are interested in e.g. ['ENZ', 'MEM']"""
    with open(filename, 'r') as fasta:
        annotation = None
        sequence = str()
        for line in fasta:
            if line[0] == '>': ## identify header
                header = line.strip()
                if annotation in filt:
                    return (header, sequence)
                annotation = header.split('_')[-1] ## set annotation as last element of header
                sequence = str()
                continue
            else:
                sequence += line.strip()
        if annotation in filt:
            return (header, sequence) ## return final sequence and header

And using your example file, here is how you would use the function:

>>> retrieve_enzyme_proteins('80929.fa', ['ENZ'])
('>PDB1a1ta_ENZ', 'MQKGNFRNQRKTVKCFNCGKEGHIAKNCRAPRKKGCWKCGKEGHQMKDCTERQAN')

A few things to mention:

Please try to avoid using variable names such as file that conflict with the global namespace of the base Python modules.
You'll want to use parentheses whenever doing multiple boolean comparisons such as if (x > 10) and (y <= 5):
Note that when making comparisons such as x == 5 you use two =, and when assigning a value to an object such as x = 5 you only use one.
It's usually best to think in terms of your program reading data line by line, acting on that line, then moving to the next line. There is no such easy thing as reading a line and then 'jumping forward' to do something, while implying that we are still 'on' the old line. It's much better to store lines as objects that we then use later on during the processing of a different line.

score 1 · Answer 2 · 2013-09-10

1

Entering edit mode

11.3 years ago

pld 5.1k

Here is a compact function to pull the sequences out of a multiple-entry FASTA file for those entries containing header in the FASTA definition line:

def fasta(indir, header):
    return [reduce(lambda x,y: x+y, x.split('\n')[1:]) 
         for x in filter(lambda x: header in x, open(indir, 'rb').read().split('>')[1:])]

This can be expanded into a few steps if you want:

#read the raw file
raw = open(indir, 'rb').read()  

#then i want to generate a list with each fasta entry as an index
#eg [">EMZ\n AAAA\n", ">PDB\n GGGG"]
splitfi = raw.split(">")[1:]

#then I want to collect only those who have the right header
header = "EMZ"
cleaned = filter(lambda x: header in x, splitfi)

#then i concat the sequences and drop the headers
final = [reduce(lambda x,y: x+y, x.split("\n")[1:]) for x in cleaned)

Note, with windows newlines are delimited by \r\n so you'll have to adjust the code to meet that difference.

Say your file contents are

>EMZ
AAAAAAAA
>PDB
GGGGGGG

Calling the fasta function above would look like:

`>>> fasta(myfile, 'EMZ')`
['AAAAAAAA']

ADD COMMENT • link 11.3 years ago by pld 5.1k

0

Entering edit mode

@joe Many thanks for your help, that is quite a complicated function! I used the non-function, expanded version and it worked well. Can I ask one more thing: Having all sequences now stored within a list, each being its own list as: [ ['header', 'sequence'] is there a way to write these sequences to a file so that I can run a script on them? I've tried using the "WRITE" function python has but it raises a TypeError: expected a character buffer object when doing open()

ADD REPLY • link 11.3 years ago by s.charonis ▴ 100

1

Entering edit mode

Without seeing your code, and assuming your sequences are stored as [[h1,s1],[h2,s2]]:

def writeFasta(name, seqs):
    with open(name, "w") as fi:
        map(lambda x: fi.write(x[0] + "\n" + x[1] + "\n"), seqs)

I am assuming your are just calling write on your list, you have to iterate through the list.

ADD REPLY • link 11.3 years ago by pld 5.1k

0

Entering edit mode

+1 because your solution should work, bit it might be best to avoid both anonymous (lambda) functions as well as map/reduce in this answer:

def writeFasta(name, seqs):
    with open(name, "w") as fi:
        for pair in seqs:
            header, sequence = pair
            fi.write('{0}\n{1}\n'.format(header, sequence))

ADD REPLY • link 11.3 years ago by Matt Shirley 10k

0

Entering edit mode

I would argue that there isn't anything wrong with using map/reduce and anonymous functions, especially when you're just using it to do some simple currying. You can take it too far, but I've never understood the argument for mapping a lambda being harder to read than a loop.

ADD REPLY • link 11.3 years ago by pld 5.1k

0

Entering edit mode

I don't think there is anything wrong with using them, but for someone who is beginning to learn programming paradigms I do believe that a loop is universally easier to understand than map/reduce on a list, because as you said above, eventually "you have to iterate through the list", and iteration is where most people start.

ADD REPLY • link 11.3 years ago by Matt Shirley 10k

0

Entering edit mode

@Joe: Many thanks for your help, that works fine. As you said, I just call write on my list and did iterate through it, but a TypeError was raised. That's fixed now!

ADD REPLY • link 11.3 years ago by s.charonis ▴ 100

score 0 · Answer 3 · 2013-09-10

I don't know much about Python, but if your text file is properly formatted fasta then this can be solved with grep.

Given this text file saved as temp.fasta

   >PDB1a0na_unknown 
   PPRPLPVAPGSSKT
   >PDB1a1ta_ENZ
   MQKGNFRNQRKTVKCFNCGKEGHIAKNCRAPRKKGCWKCGKEGHQMKDCTERQAN
   >ASDFG
   ABCDEFGS
   >ONETWOTHREE
   ANOTHERRANDOMSEQUENCE

I run this command:

grep -A 1 "_ENZ" temp.fasta
>PDB1a1ta_ENZ
MQKGNFRNQRKTVKCFNCGKEGHIAKNCRAPRKKGCWKCGKEGHQMKDCTERQAN

Then just strip out the lines with > by piping to another grep.

grep -A 1 "_ENZ" temp.fasta | grep -v '^>'
MQKGNFRNQRKTVKCFNCGKEGHIAKNCRAPRKKGCWKCGKEGHQMKDCTERQAN

The key here is the "-A 1" which returns the next line AFTER a search match for "_ENZ".

EDIT: What if your protein sequences span more than one line? If you need to pull out everything between the record identifier lines ( > symbol indicates new record ) then the code gets a little more tricky. A programming expert would not consider this tricky, but we're just trying to get the work done as quickly and correctly as possible, so I do not recommend trying to write any new code.

There exists a tool called faSomeRecords provided by the UCSC, compiled to Linux or Mac where you can provide a list of the record names and it will extract them all. Get your record names with

grep  "_ENZ"  temp.fasta > records.txt

Then extract those sequences with

faSomeRecords temp.fasta records.txt output.txt

Now if you need this to run under windows, we're going to have to find another similar solution.

I found this answer over here http://seqanswers.com/forums/showthread.php?t=9498 and there is much discussion of perl based solutions that should work under any operating system.

score 0 · Answer 4 · 2013-11-26

Well, if you have a lot of sequences with different name that need to be extracted from the dataset. You dont have to use python.
for example: you have a .txt file containing the names of the sequencing you want to extract a.txt

111111111 222222222 333333333

Then you have a datasets like this b.fas

11111111 ATCGATCTATCG 22222222 ATCCCCCCCCC 33333333 GGGGGGGGGG 44444444 AAAAAAAAAAA

you can use this code to do that :

for i in cat a.txt; do awk 'BEGIN{RS=">"}/'$i'/{print ">"$0 }' b.fas>> output;done