Question

How to remove poor quality sequences from a fasta file?

0

Entering edit mode

5.0 years ago

Kumar ▴ 120

I have a fasta file, which consist of thousands of viral genomes. I need to remove poor quality genomes which contain more than 30% NNNNNNN. Therefore, kindly help me to do the same.

perl python shell bash fasta • 2.0k views

ADD COMMENT • link 5.0 years ago by Kumar ▴ 120

0

Entering edit mode

I need to remove poor quality genomes which contain more than 30% NNNNNNN.

You can't assume that those genomes are poor quality. Perhaps that region is just not sequenced. N's are also used to pad/indicate areas that are not sequenced/sequenceable using current technologies.

ADD REPLY • link 5.0 years ago by GenoMax 152k

0

Entering edit mode

I agree @genomax, But these sequences are creating problems while alignment, that is why I would like to remove the same. I have good sum of viral genomes, therefore, I would like to keep precise base called genomes rather than the NNNN contains genomes.

ADD REPLY • link 5.0 years ago by Kumar ▴ 120

score 3 · Accepted Answer · 2020-07-29

3

Entering edit mode

5.0 years ago

microfuge ★ 2.0k

UCSC browser utils has a binary which does exactly that available here http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/faFilterN

faFilterN - Get rid of sequences with too many N's

usage: faFilterN in.fa out.fa maxPercentN

ADD COMMENT • link 5.0 years ago by microfuge ★ 2.0k

0

Entering edit mode

Thank you @microfuge. I have one clarification, in the command line maxPercentN should be replaced by 30 or 30%?

ADD REPLY • link 5.0 years ago by Kumar ▴ 120

1

Entering edit mode

From what I remember Just the number without percentage sign.

ADD REPLY • link 5.0 years ago by microfuge ★ 2.0k

1

Entering edit mode

What do you think? Will % be a valid input for an option?

ADD REPLY • link 5.0 years ago by GenoMax 152k

1

Entering edit mode

My rule of thumb: If doing something takes less than 10 seconds and can't hurt anyone, I just do it. Writing a question and waiting for a response is guaranteed to take longer than that.