Question

remove sequences with non-canonical nucleotides from fasta file

0

Entering edit mode

7.4 years ago

grant.hovhannisyan ★ 2.6k

I want to print sequences form fasta file which do not have non-canonical nucleotides. Example fasta is:

>1
ATAcctcatctaGTGTG
ATGCTGCTAGTZ
>2
agagagagagagagag

My code is

from Bio import SeqIO
for record in SeqIO.parse("test.fasta", "fasta") :
    if set(record.seq) <= "ATCGatcg":
                print record

Instead of print the sequence of >2, it prints both.

What am I doing wrong? Thanks

SeqIO fasta • 3.1k views

ADD COMMENT • link updated 7.4 years ago by Eric Lim ★ 2.2k • written 7.4 years ago by grant.hovhannisyan ★ 2.6k

1

Entering edit mode

7.4 years ago

Eric Lim ★ 2.2k

from Bio import SeqIO
from Bio.Alphabet.IUPAC import IUPACUnambiguousDNA
for record in SeqIO.parse("test.fasta", "fasta"):
    if set(record.seq.upper()) <= set(IUPACUnambiguousDNA.letters):
       print(record)

ADD COMMENT • link 7.4 years ago by Eric Lim ★ 2.2k

score 1 · Accepted Answer · 2018-02-09

1

Entering edit mode

7.4 years ago

Pierre Lindenbaum 166k

linearize and filter with awk:

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa |\
awk -F '\t' '($2 ~ /^[ATGCatgc]+$/)' |\
tr "\t" "\n"

using bioalcidaejdk:

$ java -jar dist/bioalcidaejdk.jar -e 'stream().filter(F->java.util.regex.Pattern.matches("^[ATGCatgc]+$",F)).forEach(S->println(">"+S.getName()+"\n"+S));' input.fa