Question

Best way to processing huge fasta files using python.

0

Entering edit mode

8.1 years ago

elisheva ▴ 120

Hello everyone.

I have a vary basic question but I am kind of loose in it.
I wrote a python script to do some calculations on a fasta file with multiple sequences.
when I try to do it my computer hangs.
Is there is any way to read the sequences so it will work properly?
Reading specific number of characters each time won't help - because I need the whole sequence at once.
The goal is to read all cDNAs of all genomes.

For example my file looks like this: (This is just a few sequences)

>CCE57618 cdna:annotated plasmid:HUSEC2011CHR1:pHUSEC2011-2:166:1143:1 gene:HUS2011_pII0001 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:repA description:replication initiation protein RepFIB
GTGGATAAGTCGTCCGGTGAGCTGGTGACACTGACACCAAACAATAACAACACCGTACAA
CCTGTGGCGCTGATGCGTCTGGGCGTCTTTGTACCGACCCTTAAGTCACTGAAGAACAGT
AAAAAAAATACACTGTCACGCACTGATGCCACGGAAGAACTGACGCGTCTTTCCCTGGCC
CGTGCAGAAGGATTCGATAAGGTTGAGATCACCGGTCCCCGGCTGGATATGGATAACGAT
TTCAAGACCTGGGTGGGGATCATTCATTCCTTTGCCCGCCATAACGTGACTGGTGACAAA
GTTGAACTGCCTTTTGTCGAGTTTGCAAAACTGTGTGGTATACCTTCAAGCCAGTCATCA
CGCAGGCTGCGTGAGCGCATCAGCCCTTCCCTGAAACGCATTGCCGGTACCGTGATCTCC
TTTTCCCGCACCGATGAGAAGCACACCCGGGAATATATCACCCATCTGGTACAGTCAGCC
TACTACGATACTGAACGGGATATTGTTCAGTTACAGGCTGATCCCCGCCTGTTTGAACTG
TACCAGTTTGACAGAAAGGTCCTTCTTCAGCTTAAGGCGATTAATGCCCTGAAGCGACGG
GAGTCCGCCCAGGCACTTTACACCTTTATAGAGAGCCTGCCCCGGGATCCGGCACCGATA
TCGCTGGCGCGGCTACGTGCCCGCCTCAATCTGAAGTCTCCGGTATTTTCCCAGAACCAG
ACGGTCAGACGGGCAATGGAGCAGTTACGTGAGATTGGATATCTTGATTACACGGAGATC
CAGCGGGGGCGAACAAAACTCTTCTGTATTCACTACCGACGTCCCCGGTTAAAAGCGCCG
AATGATGAGAGTAAGGAAAATCCGTTGCCACCTTCACCTGTGGAAAAAGTCAGTCCGGAG
ATGGCGGAGAAACTTGCCCTGCTTGAAAAACTGGGCATCACGCTGGATGACCTGGAAAAA
CTCTTCAAATCCCGCTGA
>CCE57619 cdna:annotated plasmid:HUSEC2011CHR1:pHUSEC2011-2:1219:1338:-1 gene:HUS2011_pII0002 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein
ATGTTTTATGAAGGGAGCAATGCCTCAGCATCAGGTTACGGGGTGACTCACGTAAGGGAC
AGGCAGATGGCAGCTCAGCCACAGGCAGCACTGCAGGAAACTGAATATAAACTGCAGTGA
>CCE57620 cdna:annotated plasmid:HUSEC2011CHR1:pHUSEC2011-2:1422:2162:-1 gene:HUS2011_pII0003 gene_biotype:protein_coding transcript_biotype:protein_coding description:site-specific recombinase
ATGAACAATGTCATTCCCCTGCAGAATTCACCAGAACGCGTCTCCCTGTTACCCATTGCG
CCGGGGGTGGATTTTGCAACAGCGCTCTCCCTGAGAAGAATGGCCACTTCCACGGGGGCC
ACACCGGCCTACCTGCTGGCCCCGGAAGTGAGTGCCCTTCTTTTCTATATGCCGGATCAG
CGTCACCATATGCTGTTCGCCACCCTCTGGAATACCGGAATGCGTATTGGCGAAGCCCGG
ATGCTGACACCGGAATCATTTGACCTGGATGGAGTAAGACCGTTTGTGCGGATCCAGTCC
GAAAAAGTGCGTGCGCGACGCGGACGCCCGCCAAAAGATGAAGTGCGCCTGGTTCCGCTG
ACAGATATAAGCTATGTCAGGCAGATGGAAAGCTGGATGATCACCACCCGGCCCCGTCGT
CGTGAACCATTATGGGCCGTGACCGACGAAACCATGCGCAACTGGCTGAAGCAGGCTGTC
AGACGGGCCGAAGCTGACGGGGTACACTTTTCGATTCCGGTAACACCACACACTTTCCGG
CACAGCTATATCATGCACATGCTCTATCACCGCCAGCCCCGGAAAGTCATCCAGGCACTG
GCTGGTCACAGGGATCCACGTTCGATGGAGGTCTATACACGGGTGTTTGCGCTGGATATG
GCTGCCACGCTGGCAGTGCCTTTCACAGGTGACGGACGGGATGCTGCAGAGATCCTGCGT
ACACTGCCTCCCCTGAAGTAA

And here what I wrote so far:

name = raw_input("file name: ") #Inputs the file's name
s = ""
ID = ""
with open(name,'r') as cDNAs:
    for line in cDNAs:
        if line.startswith('>'): #If it's a header:
            if(ID != ""): #If there is a sequence.
            print ID + "\n" + s + 2 * "\n"
                cDNA(ID,s)
            ID = line[1: line.find(' ')]
            s = "" #Initializes the sequence.
        else:
            s += line #The sequence itself.
            s = s.replace("\n", "") #Removes all the end of lines.

print ID + "\n" + s + 2 * "\n"
cDNA(ID,s)

Thanks!!!

python fasta • 14k views

ADD COMMENT • link updated 8.1 years ago by unksci ▴ 180 • written 8.1 years ago by elisheva ▴ 120

1

Entering edit mode

How big the fasta file is and what kind of computation you need?

ADD REPLY • link 8.1 years ago by shenwei356 8.7k

score 3 · Answer 1 · 2017-04-29

3

Entering edit mode

8.1 years ago

unksci ▴ 180

When processing large files with many FASTA sequences one potential reason for hanging the computer is a lack of sufficient memory.

While such a scenario can be avoided in multiple ways, a particularly convenient option is to use BioPython and its parser, which allows you to process individual entries sequentially

# Install the biopython library (if not already installed)
!pip install biopython

# Import parts of Biopython
from Bio import SeqIO


# File path to your FASTA file
path_to_file = 'example.fasta'   # <--- substitute by your local path

# Open file with "with" statement to avoid problems with access 
# to original file (in case computer hangs
# or there will be any other problem)
with open(path_to_file, mode='r') as handle:

    # Use Biopython's parse function to process individual
    # FASTA records (thus reducing memory footprint)
    for record in SeqIO.parse(handle, 'fasta'):

        # Extract individual parts of the FASTA record
        identifier = record.id
        description = record.description
        sequence = record.seq

        # Example: adapt to extract features you are interested in
        print('----------------------------------------------------------')
        print('Processing the record {}:'.format(identifier))
        print('Its description is: \n{}'.format(description))
        amount_of_nucleotides = len(sequence)
        print('Its sequence contains {} nucleotides.'.format(amount_of_nucleotides))

ADD COMMENT • link 8.1 years ago by unksci ▴ 180

1

Entering edit mode

Whereas you are right about processing one entry at a time to avoid excessive memory usage when reading very large FASTA files, my experience is that BioPython is quite slow at parsing FASTA files. I would thus not use it if speed is a major concern.

ADD REPLY • link 8.1 years ago by Lars Juhl Jensen 11k

0

Entering edit mode

And if the speed indeed important to me? At least I want the Algorithm to run on all genomes. Is there a better way for this? (even not on python...)

ADD REPLY • link 8.1 years ago by elisheva ▴ 120

0

Entering edit mode

The right way to go depends a lot on what you want to do. Reading the FASTA file is clearly not the end goal here - I assume you want to do something with the sequences. My suggestion would be that you post a new question, in which you explain what the actual problem you are trying to solve is. That would make it so much easier for us to give you a good answer.

ADD REPLY • link 8.1 years ago by Lars Juhl Jensen 11k

1

Entering edit mode

O.K. I will do it. tnx.

ADD REPLY • link 8.1 years ago by elisheva ▴ 120

1

Entering edit mode

You dont need with open() before SeqIO parse I believe. SeqIO should handle the opening of the file if its passed a filepath as a variable. Certainly this is true of SeqIO.read and SeqIO.convert so I believe will be the case for parse too.

ADD REPLY • link 8.1 years ago by Joe 22k

0

Entering edit mode

You are right indeed.

ADD REPLY • link 8.1 years ago by WouterDeCoster 48k

0

Entering edit mode

While "with" is not strictly required for SeqIO to parse, it's usage is motivated by python's error handling and the unlocking /closing of the fasta file in the preanticipation of a possible and vaguely defined "hanging of the computer" (OP) (as that hanging could also be because of OP's code that processes individual records)

ADD REPLY • link 8.1 years ago by unksci ▴ 180