Fasta File Python
3
1
Entering edit mode
2.8 years ago
cr173921 ▴ 40

How do I go about extracting elements from a fasta file. For example, if I want a list of all the IDS and then length of a sequence in another list how do I do that in base python without using any libraries?

for line in fasta:
        if line.startswith('>'):
        #Extract ids
        #Extract sequence lengths
fasta python • 5.6k views
ADD COMMENT
1
Entering edit mode

For this task, tools like bioawk , seqkit are good.

ADD REPLY
0
Entering edit mode

awk is cool ... I've never heard of seqkit

ADD REPLY
0
Entering edit mode

Its not worth coding in Python without Biopython 'cause its reinventing the wheel but len('ACGATCGA') is the basis of you code

ADD REPLY
1
Entering edit mode

I think you are making an unfounded assumption that everyone on this site wants things done the easiest way possible. That is simply not the case, as many posters are asking for solutions that will allow them to learn something from the scratch. That can be a homework, or can be a task posed by prospective employers. Granted, people who want to learn something from the scratch usually put more effort forward before coming here to ask the question, so I am leaning towards the homework assignment. Either way, the question is very clear about needing base python solution.

ADD REPLY
1
Entering edit mode

i am just asking about how to approach this as i did put in effort to learn it. thanks for the help it was so helpful Mensur Dlakic

ADD REPLY
1
Entering edit mode

Its easy you open the file with an open statement with open(filenname, 'r') as f: the next line you pass the file into it with line = readlines .... You are then gonna section it up with some sort of while line statement ... in Python you need to reference every element in the resulting list... so you need a counter (its not like Perl). USe this to load a dictionary the > is stripped and used as the key whilst the sequence is the value. Once loaded you loop through the dictionary with for keys, values in dict.items(): and use the len(values) to get the sequence lengths. Its easy but Python's statement while lines is not like Perl, in Perl this statement triggers a magic function $_ for each line in the input file. Thats it.

ADD REPLY
1
Entering edit mode

Thank you this helps :) I am having trouble sectioning it up. What i did was an if else statement, if it has a >, then i get the IDS, then the else statement is to get the sequence. What do you mean exactly by counter?

ADD REPLY
1
Entering edit mode

Upvotes are always welcome and preferred. In the approach I'm using the counter is essential, because it doesn't count the file input lines for you (assuming you're using an input file - which is normal). Yes could can do that because if you use a while line: and trigger a line by line counter i +=1 inside the while loop then if line[i][0] == '>': you're immediately in business because the next line is lstrip() (remove the '>') (rstrip() is good to removing '\n') and its in the dictionary or a list (you can use zip to join the id with the sequence into a dictionary). This is useful for massive files because its quicker than re. (regex) . Don't use an unprotected else statement ever in Python unless it leads to an Exception or a continue (if everything fails) or your really confident of all scenarios, elif is better, regex helps be certain of sequence capture - big data its avoid (its slow). Also you'll need to break the while loop and the counter is needed for that ... its weird not like Perl, so you'll need break to stop the while loop. The headache of this approach is len(line) against the counter which is set against a break. There are other approaches, but this is quick. Also keep in mind try: and except are cool.

ADD REPLY
1
Entering edit mode

thank you, can you give a quick pseudocode breakdwon, im having trouble understanding it

ADD REPLY
1
Entering edit mode

assuming that id and sequences are in single line:

  1. Read all lines starting with ">"
  2. Read all lines next to lines starting with ">
  3. Print output from step 1, print length of the lines read in step 2.
  4. Once step 3 goes prints the data you want, write the data to harddisk.
ADD REPLY
0
Entering edit mode

I do agree this is much easier ... however, fasta never puts the id and sequence in the same line this is only really phylip format. ... In this scenario tyou can use re.split on a space character (but that uses re again - but split is a key part of any text processing. and the split provides key and value for a dictionary. So it is a million-fold easier than my code, but its not realistic against Genbank fasta.

ADD REPLY
2
Entering edit mode

I think I am not clear when I say IDs and sequences are single line. What I meant is this:

>seq 1
atgcgc

Not like this:

>seq 1
atgc
gc

In the first example sequence is in single line (flattened) and in the second example sequence spans multiple lines.

Given that this is what would be one of the pseudocode possible for parsing fasta:

  1. Read file as iterator
  2. Create two empty lists: ID and Length
  3. Loop over iterator and append the lines that start with ">" to ID list and append the length of the lines that do not start with ">" to Length list (assuming that sequences are in single line)
  4. Create a pandas data frame with ID and Length lists with column names as ID and Length.
  5. Write pandas data frame as excel/csv/tsv.
ADD REPLY
0
Entering edit mode

okay, thanks my misunderstanding

ADD REPLY
0
Entering edit mode

Thank you, what if the sequence lines take up multiple lines?

ADD REPLY
0
Entering edit mode

I wrote out the complete code to do this below - but you ignored it so I removed it

ADD REPLY
0
Entering edit mode

Unfortunately, I wasn't able to see it???

ADD REPLY
0
Entering edit mode

No problem. That's how the above discussion spawned. Normally you'd use collections.Defaultdict to spool sequence spanning multiple lines onto the ID key using +=. In your case it doesn't appear permitted (no libraries) and I couldn't remember how to flatten a list. The flattening function is

def flatten(mylist):
    return [item for sublist in mylist for item in sublist]

You need len(seq) and can't do that if the sequence is stuck in a list, which is what will otherwise happen if it spans multiple lines.

Goodluck with your project

ADD REPLY
0
Entering edit mode
import pandas as pd

with open("test.fa", "r") as f:
    file = f.readlines()

name = None
sequences = dict()
for line in file:
    line = line.rstrip()
    if line[0] == '>':
        name = line[1:]
        sequences[name] = ''
    else:
        sequences[name] = sequences[name] + line
df = pd.DataFrame.from_dict(sequences, orient='index', columns=['Sequence'])
df["length"] = df["Sequence"].apply(len)
print(df)

Replace test.fa with your multi-line fasta. For explanation of the code, refer to this blog: https://gamma2.wordpress.com/2014/01/03/reading-a-fasta-file-with-python/. Code is taken from this blog and modified at the end.

ADD REPLY
0
Entering edit mode
2.8 years ago
M__ ▴ 200

Install Biopython, then this will do it

from Bio import AlignIO

align = AlignIO.read(infile, 'fasta')
for record in align:
    print ('The sequence length is %s and the id is %s' % (len(record.seq), record.id)

You might want to use SeqIO instead, but its the same idea.

from Bio import SeqIO

records = list(SeqIO.parse('infile.fa", "fasta"))
for record in records:
     print ('The sequence length is %s and the id is %s' % (len(record.seq), record.id)
ADD COMMENT
0
Entering edit mode
2.8 years ago
Mensur Dlakic ★ 28k

Not sure why anyone would want to do this in base python, unless it is for homework. If so, I think you should be doing it on your own, or at least write a longer script before asking for help.

To do this in BioPython is absolutely trivial, and you can find many solutions by searching Biostars for fasta file biopython.

ADD COMMENT
0
Entering edit mode
2.0 years ago
M.O.L.S ▴ 100

It is about 20 lines of python code. Maybe someone else can do it in less.

dna = []
header = []

infile = open("fasta.fsa","r")

for line in infile:
    if line.startswith(">"):
        header.append(line.strip())
        dna.append('')
    else:
        dna[-1] += line.strip("\n")

# list of the headers.
print(header)   

 # list of the corresponding dna sequences (lengths)
 for length_of_sequences in dna:
      print(len(length_of_sequences), sep = " ")
 infile.close()
ADD COMMENT

Login before adding your answer.

Traffic: 1871 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6