Question

Filter DNA sequence in FASTA file by minimum length and redirect sequences that don't meet condition to new file using python

0

Entering edit mode

6.0 years ago

asidhu ▴ 10

I am using a FASTA file with many sequences that I wish to filter. An example of the file is as follows:

>SRR5533383:0::12:0:0:0:
GTGCCAGCAGCCGCGGTAATACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGACTATTAAGTCAGCTGAGAAAGTTTGCGGCTCAACCGTAAAATTG$
>SRR5533383:0::19:0:0:0:
GTGCCAGCAGCCGCGGTAATACGGAGGATCCGAGCGTTATCCGGATTTATTAGGTTTAAAGGGAGCGTAGATGGATGTTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTG$
>SRR5533383:0::17:0:0:0:
GTGCCAGCAGCCGCGGTAATACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGTGGACATGTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTG$

Where the line that starts with ">" is the header and the line below it is a sequence. For each sequence there is a header. I want to be able to filter the sequence based on a minimum length (150 base pairs) that they must have in order to be kept in the main file. If the sequence is < 150 then I want it to be removed from the main file and the header of the removed sequence should go to another file ("bad reads file") where the reason it was excluded is included. Such that the output of the "bad reads" file looks like:

>SRR5533383:0::17:0:0:0:
"Excluded because too short"

I'm trying to use the following code to filter the sequences but I can't get it to work:

sequences = {}

with open("seq.fasta") as file:
    header = None
    data = ''

    for line in file:
        if line.startswith(">"):
            if header and data:
                sequences[header] = data
            data = ''
            header = line.rstrip()
        else:
            data += line.rstrip()

    if header and data:
        sequences[header] = data  # deal with the last one in the file

for header, data in sequences.items():
    print("{}; {}".format(header, data))

#Creates "sequences" in header:sequences 

seq = np.asarray(list(data))
print(seq)

#Minimum threshold
min_length = 150
for line in data:
    if line >= min_length:
        print("Sequences meeting the minimum threshold are kept in paired.fasta")
    else:
        data.removeline with open("output.log", "w")
    print("%i removed because too short" % (header))

python sequencing • 5.2k views

ADD COMMENT • link updated 6.0 years ago by Heather Ward ▴ 50 • written 6.0 years ago by asidhu ▴ 10

3

Entering edit mode

6.0 years ago

rizoic ▴ 250

In case it is a requirement to do the solution without any additional packages then you can use the solution by Heather. However when parsing standard file formats it is always recommended to use specialized packages for handling them. You can use the Biopython package in python for this. An example for filtering with it is as follows

from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.fasta", format = "fasta"):
    if len(seq_record) < 150:
        print(seq_record.description)

You can find more details here

ADD COMMENT • link 6.0 years ago by rizoic ▴ 250

0

Entering edit mode

Thank you for this! I was instructed not to use Biopython otherwise this would have been the perfect method!

ADD REPLY • link 6.0 years ago by asidhu ▴ 10

score 5 · Accepted Answer · 2018-12-03

5

Entering edit mode

6.0 years ago

Heather Ward ▴ 50

You've got some problems in there - for example, seq = np.asarray(list(data)) seq is only going to contain the last line of the fasta file, not every sequence. Try going back through your code line by line and think about what the data variable gets filled with during each iteration, and why that might be so.

Here is one way to do what you want - there are many. You were getting there with the first half of your program. Keep practicing!

#!/usr/bin/env python3

sequences = {}
good_reads = []
bad_reads = []

with open("seq.fasta", "r") as file:
        for line in file:
                line=line.rstrip()
                if (line[0] == ">"):
                        header = line
                        sequences[header] = ""
                else:
                        data = line
                        sequences[header] += data

min_length=150

# figure out which reads are good/bad
for header in sequences.keys():
        if (len(sequences[header]) > min_length):
                good_reads.append(header)
        else:
                bad_reads.append(header)

# write good reads
with open("good.fasta", "w+") as good_out:
        for header in good_reads:
                good_out.write("{}\n{}\n".format(header, sequences[header]))

# write bad reads
with open("bad.fasta", "w+") as bad_out:
        for header in bad_reads:
                bad_out.write("{}\nExcluded because too short\n".format(header))

ADD COMMENT • link 6.0 years ago by Heather Ward ▴ 50

0

Entering edit mode

Thank you so much! This is exactly what I was looking for. I'm still new to python and practicing :)

Just a question, if I wanted to edit the seq.fasta file such that only the good reads remained in it, is that possible? In your solution there is a new good_reads.fasta which works perfectly but if I wanted to continue to use seq.fasta with just the good reads in it could I do that? Could it be possible to overwrite the seq.fasta file like this:

#write good reads
with open("seq.fasta", "w+") as good_out:
        for header in good_reads:
                good_out.write("{}\n{}\n".format(header, sequences[header]))

ADD REPLY • link 6.0 years ago by asidhu ▴ 10

0

Entering edit mode

Yes, this would work, however it's generally bad practice because doing this will destroy your input file.

It would be best to save your output as a new file in case something goes wrong in your program and your output isn't what you'd expect. For example, imagine in the codeblock you posted there that you accidentally wrote for header in bad_reads. If you ran this it will overwrite your input file with only the bad reads - this is obviously an easy fix in the python program to change it to good_reads, except that you no longer have any good reads in your seq.fasta file because you overwrote it.

At the very least, if you want the output name to be identical, you should make a new directory and output the filtered file there: with open("outputs/seq.fasta", "w+") as good_out:

ADD REPLY • link 6.0 years ago by Heather Ward ▴ 50