Question

Remove Invalid Nucleotides From Fasta File

2

Entering edit mode

12.6 years ago

Justin ▴ 470

Fasta files typically contain invalid characters such as "NNNN", how can I remove those and then make sure I still have e.g. a 60 nucleotides per line?

I feel like there's probably a package that does this already.

fasta nucleotide • 9.2k views

ADD COMMENT • link updated 12.6 years ago by Flashton • 0 • written 12.6 years ago by Justin ▴ 470

2

Entering edit mode

Others have pointed out that "N" is not invalid; it means "base not known". I'll point it out again, because it's important.

ADD REPLY • link 12.6 years ago by Neilfws 49k

score 4 · Answer 1 · 2012-07-03

4

Entering edit mode

12.6 years ago

Leonor Palmeira 3.9k

You could this in R with the seqinR package:

library(seqinr)
x=read.fasta("myoriginal.fasta")
new=lapply(seq(length(x)), function(i) {
      s2c(gsub("N","",c2s(getSequence(x[[i]]))))
})
write.fasta(new,names=names(x),file="newfile.fasta",nbchar=60)

But the N characters do have a meaning, they are not 'invalid nucleotides'. They mean that this base is not fully determined in the sequence you have, so you should make sure taking them out is really what you want to do...

ADD COMMENT • link 12.6 years ago by Leonor Palmeira 3.9k

4

Entering edit mode

Commenting to re-emphasize this point: "you should make sure taking them out is really what you want to do"

ADD REPLY • link 12.6 years ago by Chris Miller 22k

score 4 · Answer 2 · 2012-07-03

try this kinda ugly python script: (need biopython)

import sys
from Bio import SeqIO

faFile = open('your file.fa','r')

for record in SeqIO.parse(faFile,'fasta'):
  filteredSeq = str(record.seq).lower().replace('n','')
  outputLines = [filteredSeq[i:i+60] for i in range(0, len(filteredSeq), 60)]

  print ">" + record.id
  for outputLine in outputLines:
    print outputLine

score 2 · Answer 3 · 2012-07-03

this simple C program should discard all the non ATGC nucleotides:

#include <stdio.h>
int main(int argc,char** argv)
    {
    int c;
    int N=0;
    do
        {
        c=fgetc(stdin);
        switch(c)
            {
            case EOF:
            case '>':
                {
                if(N!=0) fputc('\n',stdout);
                if(c==EOF) break;
                fputc(c,stdout);
                while((c=fgetc(stdin))!=EOF && c!='\n')
                    {
                    fputc(c,stdout);
                    }
                fputc('\n',stdout);
                N=0;
                break;
                }
            case 'A': case 'a':
            case 'T': case 't':
            case 'G': case 'g':
            case 'C': case 'c':
                {
                if(N==60) { fputc('\n',stdout); N=0;}
                fputc(c,stdout);
                ++N;
                break;
                }
            default:break;
            }
        } while(c!=EOF);
    return 0;
    }

score 0 · Answer 4 · 2012-10-24

A colleague helped me do this in quite a neat way using regular expressions in python (biopython required).

from Bio import SeqIO
import re

handle = open("/Users/phil/Documents/Infernal/NCBI_nr_clean.fasta", "rU")

sequences={}

for record in SeqIO.parse(handle, "fasta") :
    #replace any non-GATC characters in your sequences with nothing
    sequence = re.sub('[^GATC]', "", str(record.seq).upper())
    #discard sequences that are less than 60 bp long
    if (len(sequence))>=60:
        sequences[sequence]=record.id

output_file=open("output.fasta","w+")

for sequence in sequences:
    output_file.write(">"+sequences[sequence]+"\n"+sequence+"\n")
output_file.close()
handle.close()