Hello everyone.
I have a vary basic question but I am kind of loose in it.
I wrote a python script to do some calculations on a fasta file with multiple sequences.
when I try to do it my computer hangs.
Is there is any way to read the sequences so it will work properly?
Reading specific number of characters each time won't help - because I need the whole sequence at once.
The goal is to read all cDNAs of all genomes.
For example my file looks like this: (This is just a few sequences)
>CCE57618 cdna:annotated plasmid:HUSEC2011CHR1:pHUSEC2011-2:166:1143:1 gene:HUS2011_pII0001 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:repA description:replication initiation protein RepFIB
GTGGATAAGTCGTCCGGTGAGCTGGTGACACTGACACCAAACAATAACAACACCGTACAA
CCTGTGGCGCTGATGCGTCTGGGCGTCTTTGTACCGACCCTTAAGTCACTGAAGAACAGT
AAAAAAAATACACTGTCACGCACTGATGCCACGGAAGAACTGACGCGTCTTTCCCTGGCC
CGTGCAGAAGGATTCGATAAGGTTGAGATCACCGGTCCCCGGCTGGATATGGATAACGAT
TTCAAGACCTGGGTGGGGATCATTCATTCCTTTGCCCGCCATAACGTGACTGGTGACAAA
GTTGAACTGCCTTTTGTCGAGTTTGCAAAACTGTGTGGTATACCTTCAAGCCAGTCATCA
CGCAGGCTGCGTGAGCGCATCAGCCCTTCCCTGAAACGCATTGCCGGTACCGTGATCTCC
TTTTCCCGCACCGATGAGAAGCACACCCGGGAATATATCACCCATCTGGTACAGTCAGCC
TACTACGATACTGAACGGGATATTGTTCAGTTACAGGCTGATCCCCGCCTGTTTGAACTG
TACCAGTTTGACAGAAAGGTCCTTCTTCAGCTTAAGGCGATTAATGCCCTGAAGCGACGG
GAGTCCGCCCAGGCACTTTACACCTTTATAGAGAGCCTGCCCCGGGATCCGGCACCGATA
TCGCTGGCGCGGCTACGTGCCCGCCTCAATCTGAAGTCTCCGGTATTTTCCCAGAACCAG
ACGGTCAGACGGGCAATGGAGCAGTTACGTGAGATTGGATATCTTGATTACACGGAGATC
CAGCGGGGGCGAACAAAACTCTTCTGTATTCACTACCGACGTCCCCGGTTAAAAGCGCCG
AATGATGAGAGTAAGGAAAATCCGTTGCCACCTTCACCTGTGGAAAAAGTCAGTCCGGAG
ATGGCGGAGAAACTTGCCCTGCTTGAAAAACTGGGCATCACGCTGGATGACCTGGAAAAA
CTCTTCAAATCCCGCTGA
>CCE57619 cdna:annotated plasmid:HUSEC2011CHR1:pHUSEC2011-2:1219:1338:-1 gene:HUS2011_pII0002 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein
ATGTTTTATGAAGGGAGCAATGCCTCAGCATCAGGTTACGGGGTGACTCACGTAAGGGAC
AGGCAGATGGCAGCTCAGCCACAGGCAGCACTGCAGGAAACTGAATATAAACTGCAGTGA
>CCE57620 cdna:annotated plasmid:HUSEC2011CHR1:pHUSEC2011-2:1422:2162:-1 gene:HUS2011_pII0003 gene_biotype:protein_coding transcript_biotype:protein_coding description:site-specific recombinase
ATGAACAATGTCATTCCCCTGCAGAATTCACCAGAACGCGTCTCCCTGTTACCCATTGCG
CCGGGGGTGGATTTTGCAACAGCGCTCTCCCTGAGAAGAATGGCCACTTCCACGGGGGCC
ACACCGGCCTACCTGCTGGCCCCGGAAGTGAGTGCCCTTCTTTTCTATATGCCGGATCAG
CGTCACCATATGCTGTTCGCCACCCTCTGGAATACCGGAATGCGTATTGGCGAAGCCCGG
ATGCTGACACCGGAATCATTTGACCTGGATGGAGTAAGACCGTTTGTGCGGATCCAGTCC
GAAAAAGTGCGTGCGCGACGCGGACGCCCGCCAAAAGATGAAGTGCGCCTGGTTCCGCTG
ACAGATATAAGCTATGTCAGGCAGATGGAAAGCTGGATGATCACCACCCGGCCCCGTCGT
CGTGAACCATTATGGGCCGTGACCGACGAAACCATGCGCAACTGGCTGAAGCAGGCTGTC
AGACGGGCCGAAGCTGACGGGGTACACTTTTCGATTCCGGTAACACCACACACTTTCCGG
CACAGCTATATCATGCACATGCTCTATCACCGCCAGCCCCGGAAAGTCATCCAGGCACTG
GCTGGTCACAGGGATCCACGTTCGATGGAGGTCTATACACGGGTGTTTGCGCTGGATATG
GCTGCCACGCTGGCAGTGCCTTTCACAGGTGACGGACGGGATGCTGCAGAGATCCTGCGT
ACACTGCCTCCCCTGAAGTAA
And here what I wrote so far:
name = raw_input("file name: ") #Inputs the file's name
s = ""
ID = ""
with open(name,'r') as cDNAs:
for line in cDNAs:
if line.startswith('>'): #If it's a header:
if(ID != ""): #If there is a sequence.
print ID + "\n" + s + 2 * "\n"
cDNA(ID,s)
ID = line[1: line.find(' ')]
s = "" #Initializes the sequence.
else:
s += line #The sequence itself.
s = s.replace("\n", "") #Removes all the end of lines.
print ID + "\n" + s + 2 * "\n"
cDNA(ID,s)
Thanks!!!
How big the fasta file is and what kind of computation you need?