How to remove Bad Nucleotides represented by "N" from Fasta file by using UNIX? Thanks in advance
How to remove Bad Nucleotides represented by "N" from Fasta file by using UNIX? Thanks in advance
If you don't care about changing the coordinates then sed 's/N//g' your_fasta > new.fa
should do it. Note: This sledgehammer solution will remove any N's that may be in the fasta headers too.
Edit (2019): Following solution will prevent white space the solution above generates. Linearizing fasta code (first part) courtesy of @Pierre.
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < your_file | sed 's/N//g' | tr "\t" "\n" > new_file
If you need NN
nucleotides per line (instead of a single line) then fold
your file like this: fold -w NN your_file > new_file
.
Hi Genomax, I used sed 's/N//g' your_fasta > new.fa to remove all 'N' from a fasta sequence. It worked but now it has white space in the places of N's. Can you tell me how to get rid of these spaces as well? Following is how it looks like now.
Thank you.
>DAT1-COMP102480-C1-SEQ1-1788-1
ATGGCGCCACATGAACTCCGGCGTACTTTTAAGCGCACGGCAATCTCGGA
TCAACAACGGCGAAGAGATATCGCGCTTCTACGGCAGAACCAGCAGCGTT
CCGACTCACAGAATCGTGCCCGCCGCCTCGCCTCTTCTGTCCTCGCCATT
CCCGACCACTCATCTCCGGCCGAAGCCCAAGTCGACCTCCCCGACGTCGT
AGATGTCCATACCGATTTGTATCTGGATCATTCTTCGGAGCCGGAGGCCG
CTTCTCCTGCAGGCAGACAGTTGGATGTGGTCGAAGCCTCAGATTTGAAG
GGCTGGACGGCCCGCCACTGGTTCTCCCGCCAGCTTATGCTCATGGAATG
GATGATTGACGTGCCTCCTAGCCTCGATCGCGATTG
GTACGTCTTTGCAAGACC
TTCTGGTAAGCGCTGCTTTGTTGTTTCTTCAAATGGTACCACAGTGAGCA
GGCTTCGTAATGGCTCTGTTTTGCATCGTTTTCCATCTTCCTTACCTAAT
GGCGCTAGGACAAAAGAAATATCAGCTCCATCACATGTTTTTTGTATACT
TGATTGCATTTTTCATGAG
CCTGATCAGACATTTTATGTGATTGATATGATTTGT
TGGCGAGGATACTCATTATATGATTGTAGTGCGGAGTTCAGATTTTTTTG
GTTGAACTCAAAGCTTTTGGAGACTGGAGCCTGTGATCTTCCTTCAGTAT
ACCATAGGTATAGATTCAGTGTTGTACCTGCTTATGAATGTAACCAGATA
GGCTTGCAAAAAGCATATACGTGTGGAGTGACGTTTGTTAAAGATGGCCT
ATTGTTCTACAACAG
GCATGCAAATTAT
CAGGCTGGGAATACTCCATTAGCACTAGTATGGAAGGATGAATTTTGTAG
CAAATATGTTTTGGATACAGACGGTGAAGGACAGGTTCCAATACAACAAC
A
GGTTGTCTTGGAGTTG
CAAGGTACTGGGAAGTTGATTACACATGATGATCCTCCAATTGTATTTGG
CTGCTTGGAGAGAGATTTCCTTCAAAAG
TCGG
GTTTGCAAGTTGGAAATCTTCTTCGGTTTTCCATCGTGAATGAAAGCGCG
AGGATAGTTGATGGCAAGCTGGAGTTGGGAGAGATTAAATTTCTCGGCAA
AGCAAACCGTTTTCGAGCTTTTGCAGATAGCTACTCAAAG
GTATTCTTCCAGCACAC
GGCCCGTCACTCTCCTCTTCAATTCATGGATCTGATGGTATCCGTGGATC
CGAGT
sed -e '/^[^>]/s/[^ATGCatgc]/N/g' file.fa
Dear @The Bright Star, Hi and welcome to Biostars
Have you checked the FASTA cleanup by Pierre Lindenbaum in Biostars?
And also "How to remove N from fasta sequences"?
Same caveat as genomax, this will wreck your headers if they contain the 'N' string, but this is about as simple as it gets:
tr -d 'N' < seqs.fasta
If you need to preserve the headers, then we'll have to get a bit more inventive with Biopython or some other proper parser.
>tpg|Magnaporthiopsis_incrustans|JF414846
ACTGTAGTAGCTACGATCGATCAGATGATCACGTAGCATCGATCGATCATCGACTAGTAGATCACTCGACATAGATCCACATCAATAGATCATCATCATCATAATCGATCACTAGCAGCNNNNNN
>tpg|Pyricularia_pennisetigena|AB818016
NNNNNNGCAAGNTTCATGACGATGTAGAATGGCTTATCGAAGGGAGCAGGCCAGGGATTGAGGTCCGTCTCACGGGTTGGCTTCACTCCCCCACTGCCAGCCCTCTTGCTGCAACTCCACCAGAA
>tpg|Inocybe_sororia|EU525947
NNNAACCANGCCGCGACGGCGGTGCGATCGGGAAACGCGGCGGTGGCGGAGGAATCGGCCATCCTTCACCATATCGGCCAAGGATTGTGGTTCCTGTAGGGCTCGCGCAGCCCAGGACGCGCNNN
>tpg|Magnaporthiopsis_incrustans|JF414846
ACTGTAGTAGCTACGATCGATCAGATGATCACGTAGCATCGATCGATCATCGACTAGTAGATCACTCGACATAGATCCACATCAATAGATCATCATCATCATAATCGATCACTAGCAGC
>tpg|Pyricularia_pennisetigena|AB818016
GCAAGTTCATGACGATGTAGAATGGCTTATCGAAGGGAGCAGGCCAGGGATTGAGGTCCGTCTCACGGGTTGGCTTCACTCCCCCACTGCCAGCCCTCTTGCTGCAACTCCACCAGAA
>tpg|Inocybe_sororia|EU525947
AACCAGCCGCGACGGCGGTGCGATCGGGAAACGCGGCGGTGGCGGAGGAATCGGCCATCCTTCACCATATCGGCCAAGGATTGTGGTTCCTGTAGGGCTCGCGCAGCCCAGGACGCGC
Linearize your fasta if multi-line.
Trim leading and trailing ends, by splitting sequence:
#!/usr/bin/env python
import sys
with open(sys.argv[1], 'r') as f:
for line in f:
if line.startswith(">"):
header = line.strip()
seq = next(f).strip()
trim = max(seq.split('N'), key=len)
print header, '\n', trim
Save as trim_N.py
, run as python trim_N.py input.fasta > output.fasta
.
Example input/output:
>1
NNNNNNNNNGGGAGGTGTTTTGGTCCTTGATCCTATTGCCTACGGCAGCCGCTGGATTGTTATTACTCGCGGCCCAGCCGGCCATGGCCCAGGTTCAGCTGCAGCAGTCTGGGGCTGAGCTGGTGAAGCNNNNNNNNNNN
>1
GGGAGGTGTTTTGGTCCTTGATCCTATTGCCTACGGCAGCCGCTGGATTGTTATTACTCGCGGCCCAGCCGGCCATGGCCCAGGTTCAGCTGCAGCAGTCTGGGGCTGAGCTGGTGAAGC
Alternatively, if you also want to remove N's that are in the middle of the sequence:
#!/usr/bin/env python
import sys
with open(sys.argv[1], 'r') as f:
for line in f:
if line.startswith(">"):
header = line.strip()
seq = next(f).strip()
trim = ''.join(seq.split('N'))
print header, '\n', trim
Save and run same as above.
Example input/output:
>1
NNNNNNNNNGGGAGGTGTTTTGGTCCTTGATCCTATTGCCTACGGCNNNNNAGCCGCTGGATTGTTATTACTCNNNNNGCGGCCCAGCCGGCCATGGCCCAGGTTCAGCTGCAGCAGTCTGGGGCTGAGCTGGTGAAGCNNNNNNNNNNN
>1
GGGAGGTGTTTTGGTCCTTGATCCTATTGCCTACGGCAGCCGCTGGATTGTTATTACTCGCGGCCCAGCCGGCCATGGCCCAGGTTCAGCTGCAGCAGTCTGGGGCTGAGCTGGTGAAGC
Not a pure UNIX solution, but the SeqBuddy --replace_subseq command can clean up any arbitrary sequence pattern from any standard sequence or alignment format.
$: seqbuddy.py <input file> --replace_subseq 'N'
Sample input:
>Bca-PanxA Random meta data with NNNNs in it
NNGGACATTTTAAGNGTCGTCACTCGTTTCCCTATACTAGNGTTNGGNGTAGAACGTCAC
GANGANGACTTNGCAGACAGAATAAACTACAAGTATACGG
>Pae-PanxB
ANGTTNGACGTCTTNGGATCNGTAAAGGGCCTACTAAAACTNGACAGCGNGNGCATAGAT
AATAACGTATTCCGGCTTCATTATAAAGCTACNGTAATAA
Result:
>Bca-PanxA Random meta data with NNNNs in it
GGACATTTTAAGGTCGTCACTCGTTTCCCTATACTAGGTTGGGTAGAACGTCACGAGAGA
CTTGCAGACAGAATAAACTACAAGTATACGG
>Pae-PanxB
AGTTGACGTCTTGGATCGTAAAGGGCCTACTAAAACTGACAGCGGGCATAGATAATAACG
TATTCCGGCTTCATTATAAAGCTACGTAATAA
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
If your aim is to remove sequences with particular % or number of Ns then you can try Prinseq-lite with -ns_max_p and -ns_max_n option respectively.
Even it can help you remove leading and trailing Ns using -trim_ns_left and -trim_ns_right option
Do you need to trim leading/trailing N's?
Yes
Dear, I have been trying to use the supplied command for removing the "N", but the output file does not allow me to do FastQc. Could you help me fix this? The command I used was this:
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < elimination_n | sed 's/N//g' | tr "\t" "\n" > elimination1
Thanks!
Solution below only works for fasta format files. What format is your file in? If you have fastq format data then this solution will not work. If you want to show us output of
head -4 elimination_n
.My files are in fastq format :( . Thank you very much for your help. Is there a command that allows me to remove them from the fastq format?
I removed the first 20 bases from my reads and decreased the "N" content. Will it be necessary to eliminate those misnamed bases from my reads?