Question

Python FASTA scripting

0

Entering edit mode

5.7 years ago

damonlbp ▴ 20

Hi all, I finally got that illusive Bioinformatician job and am now undergoing training. My first project is to write a FASTA parsing script that will take a FASTA file and split it into multiple files each containing a gene and its sequence. The task is to be done in pure Python 3.7 so no Biopython for me.

TestFile = "/Users/me/Desktop/PythonScrip/testanimals.fasta"
f1 = open(TestFile, "r")
titlelist = []
seqlist=[]
line = f1.readline()
if not line.startswith('>'):
    raise TypeError("Not a Fasta, Check file and start again.")
for line in f1:
    if line.startswith(">"):
        titlelist.append(line)
    if not line.startswith(">"):
        seqlist.append(line)
    if line.startswith('\n'):
        seqlist.remove('\n')
else:
    print("Carrying on...")
    print(titlelist)
    print(seqlist)


    for name in titlelist:
        names=name.strip(">").strip()
    for seq in seqlist:
        seqs=seq.strip(">").strip()
        file = open("/Users/me/Desktop/PythonScrip/gene{}.fasta".format(names),'w')
        file.write(StringToWrite)
        StringToWrite = "{}\n{}".format(names, seqs)
    print(StringToWrite)
    print(names)
    print(seqs)

My input file is very basic:

>duck
quackquackquack

>dinosaur
roarroarroarroar

>dog
woofwoofwoof

>fish
glubglubglub

>frog
ribbetribbetribbet

I am currently having a bit of an issue at this point. The script works great, multiple files are created with different names but they all contain the same sequence, eg all of the animals go ribbetribbetribbet. although funny its also annoying. I can't quite get my head around what I've done though I think it has something to do with the 1st for being a priority and essentially ignoring the second.

I know it's ugly code but I am just getting started :)

Thanks for the help!

Python FASTA • 10.0k views

ADD COMMENT • link updated 2.0 years ago by Ram 45k • written 5.7 years ago by damonlbp ▴ 20

4

Entering edit mode

The task is to be done in pure Python 3.7 so no Biopython for me.

That's a shitty requirement.

ADD REPLY • link 5.7 years ago by WouterDeCoster 48k

0

Entering edit mode

Counterpoint: Biopython is slow, and learning to parse files with Python is an invaluable skill. It's a decent exercise for a beginning bioinformatician or someone learning Python.

ADD REPLY • link 5.6 years ago by Alex Reynolds 36k

0

Entering edit mode

Writing your own parser for everything is interesting but often unnecessary. Depends on what you want to get out of it I guess. Also fastq has no clear definition, lots of corner cases.

ADD REPLY • link 5.6 years ago by WouterDeCoster 48k

0

Entering edit mode

This will fail with fasta files that have linebreaks in seqs, no?

ADD REPLY • link 5.7 years ago by 5heikki 11k

0

Entering edit mode

Yeah looks that way to me. The seqlist collection only works when each sequence is a single line, and there is a 1 to 1 correspondence between the titlelist and the seqlist, else it will start concatenating sequences from mutliple entries together.

@damonlbp, You need to consider a more complicated test case, where each sequence is on multiple lines (which is by far the most common fasta format). In this case, you need to 'chunk' the file between the > characters.

Here's one option to 'chunk' the file (though it does require having the whole thing in memory which isn't ideal):

import sys, re

regex = re.compile("(>.*?)\n([ATGCNatgcn\n]*)", re.DOTALL)

with open(sys.argv[1], 'r') as fh:
    content = fh.read()
    seqdict = {}
    for i in re.findall(regex, content):
        seqdict[i[0]] = i[1].replace('\n', '')

print(seqdict)

Which for this file:

>1
ATACTAGCTA
CGATCGGGCT
ATATGCGCGC
T
>2
ATCGCGATTG
AGCGCGCTAG
TGATGCCGCG
CAATGCTAGG
>3
ATCGCGATTG
AGCGCGCTAG
TGATGCCGCG
CAATGCTAGG
>4
TTCGCGATTG
AGCGCGCTAT
TGATGCCGCC
CAATGCTA

Results in:

{'>1': 'ATACTAGCTACGATCGGGCTATATGCGCGCT', '>2': 'ATCGCGATTGAGCGCGCTAGTGATGCCGCGCAATGCTAGG', '>3': 'ATCGCGATTGAGCGCGCTAGTGATGCCGCGCAATGCTAGG', '>4': 'TTCGCGATTGAGCGCGCTATTGATGCCGCCCAATGCTA'}

ADD REPLY • link 5.7 years ago by Joe 22k

0

Entering edit mode

it will be much easier to 1) read all content as a single string; 2) split by '>'; 3) split by '\n'. And most of the similar work is actually written quite nicely in the book: http://shop.oreilly.com/product/9780596154516.do

ADD REPLY • link 5.7 years ago by shoujun.gu ▴ 350

0

Entering edit mode

No, you can’t split by ‘\n’, because fasta sequences are line wrapped. You cannot rely on line breaks to delineate sequence from headers.

Plus, 'easier' is subjective. I would challenge you to write something that iterates a whole file and successfully 'chunks' the genes in fewer lines than the regex loop above.

ADD REPLY • link 5.7 years ago by Joe 22k

0

Entering edit mode

this will print the same result as your code when apply on your sample file without using re.

txt=open('sample.txt', 'r').read()
genes=txt.split('>')[1:]

seqdic={}
for gene in genes:
    gene=gene.strip().split('\n')
    seqdic['>'+gene[0]]=''.join(gene[1:])

print(seqdic)

ADD REPLY • link 5.7 years ago by shoujun.gu ▴ 350

score 2 · Answer 1 · 2019-10-31

One approach I'd suggest uses standard input. This makes it very easy to test out different inputs:

#!/usr/bin/env python

import sys
import re
import os
import errno

sequence = ""
for line in sys.stdin:
  line = line.rstrip()
  if line.startswith('>'):
    if len(sequence) > 0:
      clean_header = re.sub('[^A-Za-z0-9]+', '', header)
      out_fn = '{}.fa'.format(clean_header)
      if os.path.exists(out_fn):
        raise OSError(errno.EEXIST)
      with open(out_fn, 'w') as o:
        o.write('>{}\n{}\n'.format(header, sequence))
    header = line.lstrip('>')
    sequence = ""
  else:
    sequence += line

# print out last element
if len(sequence) > 0:
  clean_header = re.sub('[^A-Za-z0-9]+', '', header)
  out_fn = '{}.fa'.format(clean_header)
  if os.path.exists(out_fn):
    raise OSError(errno.EEXIST)
  with open(out_fn, 'w') as o:
    o.write('>{}\n{}\n'.format(header, sequence))

To use:

$ python splitter.py < input.fa

This does a few things that I think make for a decent Python script:

It should work with Python 2 or 3.
It handles multiline Fasta input.
It does not require slow dependencies like Biopython.
If does some rudimentary filename sanitation, in case the Fasta record's header has characters in it that would cause problems with a filesystem (slashes, some punctuation, etc.)
It does some rudimentary error checking, so that you don't overwrite files with the same file name.
It uses Unix standard input and error streams.
It uses the with motif to automatically close file handles, once the work is done.

score 0 · Answer 2 · 2019-09-27

Hello,

for name in titlelist:
        names=name.strip(">").strip()

here you are overwrite names with each iteration. The result is, that names contains the last name in the list.

for seq in seqlist:
        seqs=seq.strip(">").strip()
        file = open("/Users/me/Desktop/PythonScrip/gene{}.fasta".format(names),'w')
        file.write(StringToWrite)
        StringToWrite = "{}\n{}".format(names, seqs)

Now names is taken as a file name in each iteration. Because names doesn't change anymore, you are always overwrite the same file. The fasta header stays the same, only the sequence is changing with each iteration.

The script works great, multiple files are created with different names

This cannot be true (see above). Furthermore you should get a NameError: name 'StringToWrite' is not defined, because you are using the variable before assign it.

Some more tips:

You don't take care of the case, where the sequence is longer then one line.
Get familiar with the with statement for opening/closing files
Also the concept of dictionaries will be useful for this usecase

score 0 · Answer 3 · 2019-09-27

You can do the following:

import sys

def fasta_parser(fasta):
    """
    Iterates through fasta
    Args:
        fasta (file): fasta_file
    Returns:
        seqs (list): list of seqs 
        headers (list): list of headers
    """
    seqs = []
    headers = []
    with open(fasta) as fa:
        sequence = ""
        header = None
        for line in fa:
            if line.startswith('>'):
                headers.append(line[1:-1])
                if header:
                    seqs.append([sequence])
                sequence = ""
                header = line[1:]
            else:
                sequence += line.rstrip()
        seqs.append([sequence])
    return headers, seqs

def main():
    myfa = sys.argv[1]
    headers, seqs = fasta_parser(myfa)
    flat_seqs = []
    for seq in seqs:
        for item in seq:
            flat_seqs.append(item)
    for header, seq in zip(headers, flat_seqs):
        with open('%s.fa' %header, 'w') as new_fa:
            new_fa.write('>'+header+'\n'+str(seq))

if __name__ == "__main__":
    main()

This is dealing with all the problems listed in other answers such as multiline fastas and multi-fastas. It stores the headers and seqs in a list when you unpack the function. It is then iterating through both the headers and seqs lists (which are the same length because I have flattened the multiline sequences) and naming the file based on the header and adding a .fa extension to it, then writing the header to the first line, and the seq as a flat line under the file.

EDIT: To run the code just python3 split_fa.py /path/to/fun.fa

GenoMax · Answer 4 · 2019-10-31

0

Entering edit mode

5.6 years ago

damonlbp ▴ 20

Hi All, incase anyone wanted to know this is the function i went with in the end. With some help from people on here and on my course we managed:

def FastaSplitter(FileInput, FileOutPut):   
    Name = ""
    Seq = []

FILE = open(FileInput, 'r')
for line in FILE:
    if line[0] == '>':
        if Name == "":
            pass
        else:
            FILE.close()
        print(line)
    else:
        print(line)

It looks like I may have originally overcomplicating things. Thanks for the all the help everyone.

ADD COMMENT • link updated 5.6 years ago by GenoMax 151k • written 5.6 years ago by damonlbp ▴ 20

0

Entering edit mode

Thanks for reporting back, glad you got a solution working.

As a general comment I would use a with open() statement to handle the file opening and closing so that you don't run the risk of leaving files dangling open.

ADD REPLY • link 5.6 years ago by Joe 22k

0

Entering edit mode

Hi! I am trying to do this. how can I make it work? I don't understand how to set this: (FileInput, FileOutPut). My level of python is low. Thanks!

ADD REPLY • link 3.8 years ago by MCA • 0

0

Entering edit mode

Those are variables which stand in for filepaths. You need to provide the files to the function that you want to do the operations on.

If you are a real beginner, I suggest finding some courses or basic tutorials. Teaching Python from scratch is beyond the scope of the forum.

ADD REPLY • link 3.8 years ago by Joe 22k

0

Entering edit mode

Thank you. I understand what it is necessary the inputfile, bue I don't understand in which part of the function is mentioning the output file so I don't understand why it is necessary to call it in the function. I have been able to run and construct several python scripts so I thought that I was just a little lost with this one. I will continue my training of course. Thanks!