Hi! I'm an master thesis student that tries to learn this:) I've searched this forum but could really find anything that answers my question.
I want to extract specific chromosome sequences from the human genome to a new fasta file and I'm using this python script:
sequences = SeqIO.parse(GRCh38.p13.genome.fa, 'fasta')
with open(new_fasta.fa', 'w+') as fa:
for line in sequences:
if line.id == chr:
fa.write(">" + strline.id) + "\n")
fa.write(str(line.seq)+ "\n\n")
break # I use break here, otherwise it will continue with chr10,12,13 etc. if i choose chr1
It works, but chr1 will take like 1 sec to extract and chr20 will take 20. How can I speed this up?
truly greatful for help!
well, I'm not a python guy but use SeqIO seems to loads each whole sequence in memory while you just need to scan line per line an check for the sequence header.
Furthermore , for this task, you should have a look at
samtools faidx
There may be contributors on this site that already have a faster solution and will share it. On the surface, however, I question the need to expend the effort to lower the running time from 20 seconds to 1 second. It seems easier to wait additional 19 seconds than to potentially spend more time than that optimizing the code.