How do I go about extracting elements from a fasta file. For example, if I want a list of all the IDS and then length of a sequence in another list how do I do that in base python without using any libraries?
for line in fasta:
if line.startswith('>'):
#Extract ids
#Extract sequence lengths
I think you are making an unfounded assumption that everyone on this site wants things done the easiest way possible. That is simply not the case, as many posters are asking for solutions that will allow them to learn something from the scratch. That can be a homework, or can be a task posed by prospective employers. Granted, people who want to learn something from the scratch usually put more effort forward before coming here to ask the question, so I am leaning towards the homework assignment. Either way, the question is very clear about needing base python solution.
Its easy you open the file with an open statement with open(filenname, 'r') as f: the next line you pass the file into it with line = readlines .... You are then gonna section it up with some sort of while line statement ... in Python you need to reference every element in the resulting list... so you need a counter (its not like Perl). USe this to load a dictionary the > is stripped and used as the key whilst the sequence is the value. Once loaded you loop through the dictionary with for keys, values in dict.items(): and use the len(values) to get the sequence lengths. Its easy but Python's statement while lines is not like Perl, in Perl this statement triggers a magic function $_ for each line in the input file. Thats it.
Thank you this helps :) I am having trouble sectioning it up. What i did was an if else statement, if it has a >, then i get the IDS, then the else statement is to get the sequence. What do you mean exactly by counter?
Upvotes are always welcome and preferred. In the approach I'm using the counter is essential, because it doesn't count the file input lines for you (assuming you're using an input file - which is normal).
Yes could can do that because if you use a while line: and trigger a line by line counter i +=1 inside the while loop then if line[i][0] == '>': you're immediately in business because the next line is lstrip() (remove the '>') (rstrip() is good to removing '\n') and its in the dictionary or a list (you can use zip to join the id with the sequence into a dictionary). This is useful for massive files because its quicker than re. (regex) . Don't use an unprotected else statement ever in Python unless it leads to an Exception or a continue (if everything fails) or your really confident of all scenarios, elif is better, regex helps be certain of sequence capture - big data its avoid (its slow). Also you'll need to break the while loop and the counter is needed for that ... its weird not like Perl, so you'll need break to stop the while loop. The headache of this approach is len(line) against the counter which is set against a break. There are other approaches, but this is quick. Also keep in mind try: and except are cool.
I do agree this is much easier ... however, fasta never puts the id and sequence in the same line this is only really phylip format. ... In this scenario tyou can use re.split on a space character (but that uses re again - but split is a key part of any text processing. and the split provides key and value for a dictionary. So it is a million-fold easier than my code, but its not realistic against Genbank fasta.
I think I am not clear when I say IDs and sequences are single line. What I meant is this:
>seq 1
atgcgc
Not like this:
>seq 1
atgc
gc
In the first example sequence is in single line (flattened) and in the second example sequence spans multiple lines.
Given that this is what would be one of the pseudocode possible for parsing fasta:
Read file as iterator
Create two empty lists: ID and Length
Loop over iterator and append the lines that start with ">" to ID list and append the length of the lines that do not start with ">" to Length list (assuming that sequences are in single line)
Create a pandas data frame with ID and Length lists with column names as ID and Length.
No problem. That's how the above discussion spawned. Normally you'd use collections.Defaultdict to spool sequence spanning multiple lines onto the ID key using +=. In your case it doesn't appear permitted (no libraries) and I couldn't remember how to flatten a list. The flattening function is
def flatten(mylist):
return [item for sublist in mylist for item in sublist]
You need len(seq) and can't do that if the sequence is stuck in a list, which is what will otherwise happen if it spans multiple lines.
import pandas as pd
with open("test.fa", "r") as f:
file = f.readlines()
name = None
sequences = dict()
for line in file:
line = line.rstrip()
if line[0] == '>':
name = line[1:]
sequences[name] = ''
else:
sequences[name] = sequences[name] + line
df = pd.DataFrame.from_dict(sequences, orient='index', columns=['Sequence'])
df["length"] = df["Sequence"].apply(len)
print(df)
from Bio import AlignIO
align = AlignIO.read(infile, 'fasta')
for record in align:
print ('The sequence length is %s and the id is %s' % (len(record.seq), record.id)
You might want to use SeqIO instead, but its the same idea.
from Bio import SeqIO
records = list(SeqIO.parse('infile.fa", "fasta"))
for record in records:
print ('The sequence length is %s and the id is %s' % (len(record.seq), record.id)
Not sure why anyone would want to do this in base python, unless it is for homework. If so, I think you should be doing it on your own, or at least write a longer script before asking for help.
To do this in BioPython is absolutely trivial, and you can find many solutions by searching Biostars for fasta file biopython.
It is about 20 lines of python code. Maybe someone else can do it in less.
dna = []
header = []
infile = open("fasta.fsa","r")
for line in infile:
if line.startswith(">"):
header.append(line.strip())
dna.append('')
else:
dna[-1] += line.strip("\n")
# list of the headers.
print(header)
# list of the corresponding dna sequences (lengths)
for length_of_sequences in dna:
print(len(length_of_sequences), sep = " ")
infile.close()
For this task, tools like bioawk , seqkit are good.
awk
is cool ... I've never heard of seqkitIts not worth coding in Python without Biopython 'cause its reinventing the wheel but
len('ACGATCGA')
is the basis of you codeI think you are making an unfounded assumption that everyone on this site wants things done the easiest way possible. That is simply not the case, as many posters are asking for solutions that will allow them to learn something from the scratch. That can be a homework, or can be a task posed by prospective employers. Granted, people who want to learn something from the scratch usually put more effort forward before coming here to ask the question, so I am leaning towards the homework assignment. Either way, the question is very clear about needing base python solution.
i am just asking about how to approach this as i did put in effort to learn it. thanks for the help it was so helpful Mensur Dlakic
Its easy you open the file with an open statement
with open(filenname, 'r') as f:
the next line you pass the file into it withline = readlines ...
. You are then gonna section it up with some sort ofwhile line
statement ... in Python you need to reference every element in the resulting list... so you need a counter (its not like Perl). USe this to load a dictionary the>
is stripped and used as the key whilst the sequence is the value. Once loaded you loop through the dictionary withfor keys, values in dict.items():
and use the len(values) to get the sequence lengths. Its easy but Python's statementwhile lines
is not like Perl, in Perl this statement triggers a magic function$_
for each line in the input file. Thats it.Thank you this helps :) I am having trouble sectioning it up. What i did was an if else statement, if it has a >, then i get the IDS, then the else statement is to get the sequence. What do you mean exactly by counter?
Upvotes are always welcome and preferred. In the approach I'm using the counter is essential, because it doesn't count the file input lines for you (assuming you're using an input file - which is normal). Yes could can do that because if you use a
while line:
and trigger a line by line counteri +=1
inside thewhile
loop thenif line[i][0] == '>':
you're immediately in business because the next line islstrip()
(remove the '>') (rstrip() is good to removing '\n') and its in the dictionary or a list (you can usezip
to join the id with the sequence into a dictionary). This is useful for massive files because its quicker thanre.
(regex) . Don't use an unprotectedelse
statement ever in Python unless it leads to an Exception or acontinue
(if everything fails) or your really confident of all scenarios,elif
is better, regex helps be certain of sequence capture - big data its avoid (its slow). Also you'll need to break thewhile
loop and the counter is needed for that ... its weird not like Perl, so you'll needbreak
to stop thewhile
loop. The headache of this approach islen(line)
against the counter which is set against abreak
. There are other approaches, but this is quick. Also keep in mindtry:
andexcept
are cool.thank you, can you give a quick pseudocode breakdwon, im having trouble understanding it
assuming that id and sequences are in single line:
I do agree this is much easier ... however, fasta never puts the id and sequence in the same line this is only really phylip format. ... In this scenario tyou can use re.split on a space character (but that uses re again - but split is a key part of any text processing. and the split provides key and value for a dictionary. So it is a million-fold easier than my code, but its not realistic against Genbank fasta.
I think I am not clear when I say IDs and sequences are single line. What I meant is this:
Not like this:
In the first example sequence is in single line (flattened) and in the second example sequence spans multiple lines.
Given that this is what would be one of the pseudocode possible for parsing fasta:
okay, thanks my misunderstanding
Thank you, what if the sequence lines take up multiple lines?
I wrote out the complete code to do this below - but you ignored it so I removed it
Unfortunately, I wasn't able to see it???
No problem. That's how the above discussion spawned. Normally you'd use collections.Defaultdict to spool sequence spanning multiple lines onto the ID key using
+=
. In your case it doesn't appear permitted (no libraries) and I couldn't remember how to flatten a list. The flattening function isYou need len(seq) and can't do that if the sequence is stuck in a list, which is what will otherwise happen if it spans multiple lines.
Goodluck with your project
Replace
test.fa
with your multi-line fasta. For explanation of the code, refer to this blog: https://gamma2.wordpress.com/2014/01/03/reading-a-fasta-file-with-python/. Code is taken from this blog and modified at the end.