Question

removing fasta sequences that have Ns in it in a fasta file

0

Entering edit mode

8.8 years ago

kws15 ▴ 40

Hi everyone,

i have a giant fasta file, but some of the sequences have got Ns in them

GeneID:107003026

AAATTTACTTGTCCTTGTGAT

GeneID:107005138

TATGCACNNNGGTTGC

GeneID:107004481

GATTTTATGTTGCTGAA

so the second one has got Ns in them, what can i do to get rid of the whole sequence so that the outcome would look like this? thank you very much

GeneID:107003026

AAATTTACTTGTCCTTGTGAT

GeneID:107004481

GATTTTATGTTGCTGAA

fasta • 7.7k views

ADD COMMENT • link updated 8.8 years ago by Brian Bushnell 20k • written 8.8 years ago by kws15 ▴ 40

2

Entering edit mode

did you search for similar posts on biostars.org ? what did you find ? what have you tried ?

ADD REPLY • link 8.8 years ago by Pierre Lindenbaum 164k

1

Entering edit mode

There are many ways to do this correctly, but are you sure you want to? What is your rationale?

ADD REPLY • link 8.8 years ago by Brian Bushnell 20k

0

Entering edit mode

i am doing analysis on promoter sequences for two close species in which i would need to align them together and see the similarity , so i would need to use may be blast, but it just gives me error when i tried to do that in R when sequences contain Ns, so i guess i would just have to ignore the sequences that have Ns.

ADD REPLY • link 8.8 years ago by kws15 ▴ 40

0

Entering edit mode

That doesn't look like a FASTA file (no ">").

ADD REPLY • link 8.8 years ago by igor 13k

0

Entering edit mode

yeah, there are some '>' s in my file, they are just gone when i posted them here for some reason

ADD REPLY • link 8.8 years ago by kws15 ▴ 40

0

Entering edit mode

If you want to use only the default unix tools, you can use grep to filter out Ns (assuming your sequence names do not have Ns):

grep -v "N" in.fa

Then filter out empty records (where sequence was removed by grep):

awk '$2{print RS}$2' FS='\n' RS=> ORS= in.fasta

ADD REPLY • link 8.8 years ago by igor 13k

3

Entering edit mode

I do not recommend this, as it is unsafe. A good solution should handle all possible Fasta variants, whether they are multi-line, contain 'N' in headers, etc.

ADD REPLY • link 8.8 years ago by Brian Bushnell 20k

0

Entering edit mode

It didn't look like any of those sequences were in danger of being multi-line. To be safe, you can convert multi-line fasta to single-line fasta:

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}'

ADD REPLY • link 8.8 years ago by igor 13k

0

Entering edit mode

8.8 years ago

Matt Shirley 10k

Using pyfaidx and awk:

ADD COMMENT • link 8.8 years ago by Matt Shirley 10k

score 4 · Accepted Answer · 2016-03-24

Python, using Biopython

import sys
from Bio import SeqIO
handle = open(sys.argv[1], "rU")
filtered = [record for record in SeqIO.parse(handle, "fasta") if record.seq.count('N') == 0]
output_handle = open("N_removed.fasta", "w")
SeqIO.write(filtered, output_handle, "fasta")
output_handle.close()
handle.close()

Save script (e.g. removeNfromfas.py) and execute as python removeNfromfas.py <yourfile.fasta>

Updated version 19 months later:

import sys
from Bio import SeqIO
for record in SeqIO.parse(sys.argv[1], "fasta"):
    if record.seq.count('N') == 0:
        print(record.format("fasta")

Save script (e.g. removeNfromfas.py) and execute as `python removeNfromfas.py yourfile.fasta > newfile.fasta More flexible and can handle enormous files, if necessary. Lower memory requirements.