Question

Process to remove terminal Ns from fasta?

0

Entering edit mode

5.6 years ago

stacy734 ▴ 40

Can anyone recommend a tool or unix command line to remove terminal (leading or trailing) Ns from a fasta file?

Thanks in advance for any advice.

fasta • 5.0k views

ADD COMMENT • link updated 3.7 years ago by sullis02 ▴ 40 • written 5.6 years ago by stacy734 ▴ 40

0

Entering edit mode

Thank you everyone.

This request was to help with submission of genome assemblies to Genbank. They ask that the terminal Ns (gaps) be removed from the ends of contigs. However, I have found that if you simply leave them on they will remove them as part of their process.

ADD REPLY • link 5.6 years ago by stacy734 ▴ 40

0

Entering edit mode

stacy734 : While that may be the case since you asked this question in the first place can you please test the posted answers and accept any/all those that work. This would benefit future users who will find this thread by searching.

Upvote|Bookmark|Accept

ADD REPLY • link 5.6 years ago by GenoMax 152k

0

Entering edit mode

In fact, this no longer works, so you need one of the other solutions.

ADD REPLY • link 4.4 years ago by Michael 56k

score 3 · Answer 1 · 2019-12-16

3

Entering edit mode

5.6 years ago

cpad0112 21k

see if this works with seqkit to remove terminal Ns:

$ seqkit -is replace -p "n+$" -r "" test.fa

To remove leading Ns as well (as mentioned in the OP), try following:

$ seqkit -is replace -p "^n+|n+$" -r "" test.fa

Try this with sed to remove leading and trailing n:

$ sed -r '/^>/! s/n+$|^n+//g' test.fa

ADD COMMENT • link 5.6 years ago by cpad0112 21k

0

Entering edit mode

Your sed command runs the risk of removing internal Ns that occur before/after line breaks in the sequence. To avoid that you need to linearize the sequence first. It also creates an undesirable empty line between the definition line and the sequence.

This revision avoids all that. It linearizes the record (awk) then converts it to a 2-line fasta record (tr) then removes the leading and trailing N (sed -- I use N instead of n, because typically my Ns are capitalized; you can handle both by substituting [Nn]), then it reformats (fold) the long sequence line into lines of 80 nt width. Make sure you use a width longer than the length of your longest definition line, otherwise it will break those deflines at 80 characters too.

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' my.fasta | tr "\t" "\n" | sed -r '/^>/! s/N+$|^N+//g' |  fold -w 80 >  my.Ntrimmed.fasta

ADD REPLY • link 3.7 years ago by sullis02 ▴ 40

score 0 · Answer 2 · 2019-12-15

0

Entering edit mode

5.6 years ago

Jianyu ▴ 580

Use seqkit subseq:

example:

seqkit subseq -r 6:-1 test.fa # remove the first 5 bases

ADD COMMENT • link 5.6 years ago by Jianyu ▴ 580

0

Entering edit mode

OP does not want to remove a fixed number of bases but all N's. Which I assume may be of variable length. seqkit probably can do that. This is likely not the correct command.

ADD REPLY • link 5.6 years ago by GenoMax 152k

1

Entering edit mode

I misunderstood the question, so remove the terminal N? not ATCG? a quick thought:

awk '{if (/>.*/) {print} else { sub(/^N*/, "")sub(/N*$/, ""); print}}' test.fa

ADD REPLY • link 5.6 years ago by Jianyu ▴ 580

score 0 · Answer 3 · 2019-12-16

0

Entering edit mode

5.6 years ago

zubenel ▴ 120

Another option by using Perl oneliner:

perl -pe 's/^([ACGT][ACGTN]+?)N+$|(^N+)/$1/gi' test.fa

Apparently this command does work only if Ns are at the start OR at the end of a line. It does not work if Ns are from both sides.

ADD COMMENT • link 5.6 years ago by zubenel ▴ 120