Extract only sequences that have N nucleotides

0

Entering edit mode

9.3 years ago

waqasnayab ▴ 250

Hi,

I have a fasta file. From that fasta file, I need to extract only those sequences that have Ns nucleotides.

>seq1
AGCGGCGTAACGTCGTAGTC
>seq2
ACGCGTACNNNNNNTGCGA

I want output like this:

>seq1
AGCGGCGTAACGTCGTAGTC

Regards,
Waqas

sequencing genome • 3.8k views

ADD COMMENT • link updated 2.8 years ago by Ram 45k • written 9.3 years ago by waqasnayab ▴ 250

0

Entering edit mode

Your example has no Ns, which is the exact opposite of what you say you want.

Are the sequences always on a single line?

ADD REPLY • link 9.3 years ago by Devon Ryan 105k

0

Entering edit mode

You mean, you want to exclude the sequences with Ns?

awk -v header="" '{ if($1~/^>/){header=$1}else if($1!~/N/){print header; print $1;}}' fastaFile

Note, this only work if you don't have multi-line fasta

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.3 years ago by Sam ★ 4.8k

1

Entering edit mode

9.3 years ago

Pierre Lindenbaum 166k

linearize the fasta

Linearize a fasta sequence

awk -f linearizefasta.awk < input.fa

or

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < input.fa

Format back to fasta

tr "\t" "\n" < linearized.tsv

if you know your fasta header have a length < 60

tr "\t" "\n" < linearized.tsv | fold -w 60

view raw README.md hosted with ❤ by GitHub

	/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;}
	{printf("%s",$0);}
	END {printf("\n");}

view raw linearizefasta.awk hosted with ❤ by GitHub

filter with awk and a regular expression, convert back to fasta

awk -f linearizefasta.awk < input.fa | awk -F '\t' '($2 ~ /N/)' | tr "\t" "\n"

ADD COMMENT • link updated 5.4 years ago by Ram 45k • written 9.3 years ago by Pierre Lindenbaum 166k

1

Entering edit mode

9.3 years ago

Daniel ★ 4.0k

With one-line fasta and with sequence names specifically 'seq#' (i.e. no letter 'N's), you can super easily do:

grep -B 1 'N' input.fasta >output_no_Ns.fasta

But that is assuming your data is in the specific format above. Multi line fasta or N's in your headers would break it. But this is what I'd do if I had this data.

ADD COMMENT • link updated 5.4 years ago by Ram 45k • written 9.3 years ago by Daniel ★ 4.0k

1

Entering edit mode

9.3 years ago

waqasnayab ▴ 250

Hi,

First, I linearized multi-line fasta file to single-line fasta file:

Multiline Fasta To Single Line Fasta

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < file.fa

Than, @Daniel command:

grep -B 1 'N' input.fasta >output_no_Ns.fasta

I apologized the community for not explaining my question properly.

Best,
Waqas.

ADD COMMENT • link updated 5.4 years ago by Ram 45k • written 9.3 years ago by waqasnayab ▴ 250

1

Entering edit mode

Thanks for letting us know! You can upvote and tick the helping answers, to finish the question thread.

ADD REPLY • link 9.3 years ago by Daniel ★ 4.0k

0

Entering edit mode

9.3 years ago

Anima Mundi ★ 2.9k

In Python, for a file named foo.fa containing linearised FASTAs (prunes FASTAs with Ns):

header = ''

for line in open('foo.fa'):
    if '>' in line:
        header = line
    elif line == '\n':
        pass
    elif 'N' not in line.upper():
        print header,
        print line,

ADD COMMENT • link updated 5.4 years ago by Ram 45k • written 9.3 years ago by Anima Mundi ★ 2.9k

0

Entering edit mode

9.3 years ago

Brian Bushnell 20k

With the BBMap package:

reformat.sh in=file.fasta out=fixed.fasta maxns=0

ADD COMMENT • link updated 5.4 years ago by Ram 45k • written 9.3 years ago by Brian Bushnell 20k

Login before adding your answer.