Hello,
I downloaded from ENSEMBL either unmasked genomic sequences or hard masked (Ns for repeats) genomic sequences. There is no soft masked sequence in sight on ENSEMBL ftp server.
Flipping upper cases to lower cases for each position is quite straightforward (see below), but it takes long time for mammalian genome. Simple question: how to do it faster, be it in Python or any other language?
#!/usr/bin/env python
from pyfasta import Fasta
masked_fasta = Fasta('test10k.rm.fa')
unmask_fasta = Fasta('test10k.fa')
for seqid in unmask_fasta.keys():
print ">" + seqid
unmasked_seq = unmask_fasta[seqid]
masked_seq = masked_fasta[seqid]
output_seq = ""
for position in range(0, len(unmasked_seq)):
if masked_seq[position] == "N":
base = unmasked_seq[position].lower()
else:
base = unmasked_seq[position]
output_seq += base
print output_seq
This really speed things up. I went from ca 4hrs to 40mins. Thank you!