How to remove all-N sequence entries from fasta file.
4
0
Entering edit mode
3.6 years ago
Info.shi ▴ 30

I want to remove entries of a fasta file where all nucleotides are N, but not entries that contain ATGC and N nucleotides.

from an example-

>Seq_1
TGCTAGCTAGCTGATCGTGTCGATCGCACCACANNNNNCACGTGTCG
>Seq_2
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>Seq_3
ATGGCCTTAGTACCAGCTACAAACAACACGNNNAGCNGCCGAC

Output file content to be:

>Seq_1
TGCTAGCTAGCTGATCGTGTCGATCGCACCACANNNNNCACGTGTCG
>Seq_3
ATGGCCTTAGTACCAGCTACAAACAACACGNNNAGCNGCCGAC

Thank you!

awk perl • 2.9k views
ADD COMMENT
1
Entering edit mode

Two ways with grep (for flat fasta files) and cutadapt:

$ grep -i -B 1 --no-group-separator  '[ATGC]' test.fa

>Seq_1
TGCTAGCTAGCTGATCGTGTCGATCGCACCACANNNNNCACGTGTCG
>Seq_3
ATGGCCTTAGTACCAGCTACAAACAACACGNNNAGCNGCCGAC



$ cutadapt --quiet --max-n 0.99 test.fa

>Seq_1
TGCTAGCTAGCTGATCGTGTCGATCGCACCACANNNNNCACGTGTCG
>Seq_3
ATGGCCTTAGTACCAGCTACAAACAACACGNNNAGCNGCCGAC
ADD REPLY
1
Entering edit mode
3.6 years ago
GenoMax 147k

Use reformat.sh from BBMap suite. Set NN to the a high enough number.

reformat.sh -Xmx2g in=your.fa out=clean.fa maxns=NN
ADD COMMENT
1
Entering edit mode
3.6 years ago
ATpoint 85k
cat foo.fa
>chrA
NNNNNNNN
>chrB
TAGCTACGNNATCGNNN
>chrC
TAGCTGACATCG

samtools faidx foo.fa
seqtk comp foo.fa \
| awk '{c=0;for(i=3;i<=6;++i){c+=$i}; gsub(">",""); if(c>0) print $1}' \
| xargs samtools faidx foo.fa

>chrB
TAGCTACGNNATCGNNN
>chrC
TAGCTGACATCG

Explanation: seqtk comp determines the number of A/T/C/G in each sequence (that is the columns 3-6 of its output). awk then will print those chromosome names where the sum of these are zero (so where only non A/T/C/G where in the sequence), and samtools faidx then fetches from the fasta those entires that awk returned. The initial faidx command is used to index the fasta file.

ADD COMMENT
1
Entering edit mode
3.6 years ago

linearize, filter and reformat with awk

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa |\
awk -F '\t' '!($2 ~ /^[N]*$/) {printf("%s\n%s\n",$1,$2);}'
ADD COMMENT
1
Entering edit mode
3.6 years ago
Juke34 8.9k

You can run gaas_fasta_purify.pl from GAAS

It will

  • Remove pure Ns sequences
  • lowercase nucleotides: There are considered as repeat by annotation tool, while most assemblers consider them as region with low coverage. Fixed by the tool.
  • Leading and trailing Ns sequences are forbidden for submission to ENA. Trimmed by the tool.
  • Long internal N regions (> 10000) is problematic in order to run Genemark. The tool inform the user.
  • IUPAC code might be problematic for some tools. e.g, an extra option is mandatory for MAKER. The tool inform the user.
  • Short sequences can be useless and time consuming for annotation tools. The tool remove them. Default 1000 bp.
  • Long sequence in single line fasta might raise issues some tools. The tool fold them (80 characters).
ADD COMMENT

Login before adding your answer.

Traffic: 2482 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6