Question

Split the fasta file based on sequence type

0

Entering edit mode

2.7 years ago

Ayish • 0

Hello,

I have a large fasta file containing both nucleotide and protein sequences. I need to separate the sequences into two files based on the type of sequence. Is there any Python module that can look for ?

Thanks in advance.

python Fasta biopython • 1.2k views

ADD COMMENT • link updated 2.7 years ago by barslmn ★ 2.4k • written 2.7 years ago by Ayish • 0

1

Entering edit mode

Does the sequence identifier lines tell you whether they're DNA or protein? Or you gonna have to guess for a peptide made out of Glycine, Alanine, Cysteine and Threonine?

ADD REPLY • link 2.7 years ago by barslmn ★ 2.4k

0

Entering edit mode

Unfortunately, No. It would be guess work, I think.

ADD REPLY • link 2.7 years ago by Ayish • 0

1

Entering edit mode

Here is a snippet in python. Or you can try out the biopython module too. But be aware this guessing work can go very wrong if you have UIPAC nucleotide symbols other than ATCG. https://www.bioinformatics.org/sms/iupac.html

https://colab.research.google.com/drive/1XSQBDoLIyQUGwUJvXRtZHkcXxsVU6oRH?usp=sharing

with open('example.fasta', 'w') as f:
  f.write('>seq1DNA\nATCG\n>seq2DNA\nTCGT\nTCTC\n>seq1Protein\nSSTCG\n>seq2Protein\nHYRN\nKQES')
from collections import defaultdict
def fastaparser(fasta):
    '''
    Read fasta file and return a dict, each record with seq name as key
    '''
    records = defaultdict(list)  
    with open(fasta,  'r') as f:
        lines = f.read().split('\n')[:-1]
        for line in lines:
            if line.startswith('>'):
                key = line
                continue
            records[key].append(line) 
    f.close()
    return records
DNA_alphabet = {'A', 'T', 'C', 'G'}
for k, seqs in records.items():
  if len(set("".join(seqs)).union(DNA_alphabet)) > 4:
    with open('protein.fa', 'a') as f:
      f.write(f"{k}\n")
      f.write(f"{''.join(seqs)}\n")
  else:
    with open('DNA.fa', 'a') as f:
      f.write(f"{k}\n")
      f.write(f"{''.join(seqs)}\n")

ADD REPLY • link 2.7 years ago by barslmn ★ 2.4k

score 1 · Answer 1 · 2022-12-16

1

Entering edit mode

2.7 years ago

Matthias Zepper 5.1k

Does the solution need to be Python? Otherwise, you could use seqkit grep or seqkit fish and search for non-nucleotide letters in the sequences?

ADD COMMENT • link 2.7 years ago by Matthias Zepper 5.1k