Extract only sequences that have N nucleotides
5
0
Entering edit mode
9.0 years ago
waqasnayab ▴ 250

Hi,

I have a fasta file. From that fasta file, I need to extract only those sequences that have Ns nucleotides.

>seq1
AGCGGCGTAACGTCGTAGTC
>seq2
ACGCGTACNNNNNNTGCGA

I want output like this:

>seq1
AGCGGCGTAACGTCGTAGTC

Regards,
Waqas

sequencing genome • 3.5k views
ADD COMMENT
0
Entering edit mode

Your example has no Ns, which is the exact opposite of what you say you want.

Are the sequences always on a single line?

ADD REPLY
0
Entering edit mode

You mean, you want to exclude the sequences with Ns?

awk -v header="" '{ if($1~/^>/){header=$1}else if($1!~/N/){print header; print $1;}}' fastaFile

Note, this only work if you don't have multi-line fasta

ADD REPLY
1
Entering edit mode
9.0 years ago

linearize the fasta

filter with awk and a regular expression, convert back to fasta

awk -f linearizefasta.awk < input.fa | awk -F '\t' '($2 ~ /N/)' | tr "\t" "\n"
ADD COMMENT
1
Entering edit mode
9.0 years ago
Daniel ★ 4.0k

With one-line fasta and with sequence names specifically 'seq#' (i.e. no letter 'N's), you can super easily do:

grep -B 1 'N' input.fasta >output_no_Ns.fasta

But that is assuming your data is in the specific format above. Multi line fasta or N's in your headers would break it. But this is what I'd do if I had this data.

ADD COMMENT
1
Entering edit mode
9.0 years ago
waqasnayab ▴ 250

Hi,

First, I linearized multi-line fasta file to single-line fasta file:

Multiline Fasta To Single Line Fasta

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < file.fa

Than, @Daniel command:

grep -B 1 'N' input.fasta >output_no_Ns.fasta

I apologized the community for not explaining my question properly.

Best,
Waqas.

ADD COMMENT
1
Entering edit mode

Thanks for letting us know! You can upvote and tick the helping answers, to finish the question thread.

ADD REPLY
0
Entering edit mode
9.0 years ago
Anima Mundi ★ 2.9k

In Python, for a file named foo.fa containing linearised FASTAs (prunes FASTAs with Ns):

header = ''

for line in open('foo.fa'):
    if '>' in line:
        header = line
    elif line == '\n':
        pass
    elif 'N' not in line.upper():
        print header,
        print line,
ADD COMMENT
0
Entering edit mode
9.0 years ago

With the BBMap package:

reformat.sh in=file.fasta out=fixed.fasta maxns=0
ADD COMMENT

Login before adding your answer.

Traffic: 1467 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6