remove sequences with non-canonical nucleotides from fasta file
2
0
Entering edit mode
6.8 years ago

I want to print sequences form fasta file which do not have non-canonical nucleotides. Example fasta is:

>1
ATAcctcatctaGTGTG
ATGCTGCTAGTZ
>2
agagagagagagagag

My code is

from Bio import SeqIO
for record in SeqIO.parse("test.fasta", "fasta") :
    if set(record.seq) <= "ATCGatcg":
                print record

Instead of print the sequence of >2, it prints both.

What am I doing wrong? Thanks

SeqIO fasta • 2.8k views
ADD COMMENT
1
Entering edit mode
6.8 years ago

linearize and filter with awk:

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa |\
awk -F '\t' '($2 ~ /^[ATGCatgc]+$/)' |\
tr "\t" "\n"

using bioalcidaejdk:

$ java -jar dist/bioalcidaejdk.jar -e 'stream().filter(F->java.util.regex.Pattern.matches("^[ATGCatgc]+$",F)).forEach(S->println(">"+S.getName()+"\n"+S));' input.fa
ADD COMMENT
0
Entering edit mode

Thank you Pierre, I accept your answer. But anyways, what is wrong with SeqIO?

ADD REPLY
1
Entering edit mode
6.8 years ago
Eric Lim ★ 2.2k
from Bio import SeqIO
from Bio.Alphabet.IUPAC import IUPACUnambiguousDNA
for record in SeqIO.parse("test.fasta", "fasta"):
    if set(record.seq.upper()) <= set(IUPACUnambiguousDNA.letters):
       print(record)
ADD COMMENT

Login before adding your answer.

Traffic: 2142 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6