I have a fasta file, which consist of thousands of viral genomes. I need to remove poor quality genomes which contain more than 30% NNNNNNN
. Therefore, kindly help me to do the same.
I have a fasta file, which consist of thousands of viral genomes. I need to remove poor quality genomes which contain more than 30% NNNNNNN
. Therefore, kindly help me to do the same.
UCSC browser utils has a binary which does exactly that available here http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/faFilterN
faFilterN - Get rid of sequences with too many N's
usage: faFilterN in.fa out.fa maxPercentN
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
You can't assume that those genomes are poor quality. Perhaps that region is just not sequenced. N's are also used to pad/indicate areas that are not sequenced/sequenceable using current technologies.
I agree @genomax, But these sequences are creating problems while alignment, that is why I would like to remove the same. I have good sum of viral genomes, therefore, I would like to keep precise base called genomes rather than the NNNN contains genomes.