Remove Invalid Nucleotides From Fasta File
4
2
Entering edit mode
12.4 years ago
Justin ▴ 470

Fasta files typically contain invalid characters such as "NNNN", how can I remove those and then make sure I still have e.g. a 60 nucleotides per line?

I feel like there's probably a package that does this already.

fasta nucleotide • 9.1k views
ADD COMMENT
2
Entering edit mode

Others have pointed out that "N" is not invalid; it means "base not known". I'll point it out again, because it's important.

ADD REPLY
4
Entering edit mode
12.4 years ago

You could this in R with the seqinR package:

library(seqinr)
x=read.fasta("myoriginal.fasta")
new=lapply(seq(length(x)), function(i) {
      s2c(gsub("N","",c2s(getSequence(x[[i]]))))
})
write.fasta(new,names=names(x),file="newfile.fasta",nbchar=60)

But the N characters do have a meaning, they are not 'invalid nucleotides'. They mean that this base is not fully determined in the sequence you have, so you should make sure taking them out is really what you want to do...

ADD COMMENT
4
Entering edit mode

Commenting to re-emphasize this point: "you should make sure taking them out is really what you want to do"

ADD REPLY
4
Entering edit mode
12.4 years ago

try this kinda ugly python script: (need biopython)

import sys
from Bio import SeqIO

faFile = open('your file.fa','r')

for record in SeqIO.parse(faFile,'fasta'):
  filteredSeq = str(record.seq).lower().replace('n','')
  outputLines = [filteredSeq[i:i+60] for i in range(0, len(filteredSeq), 60)]

  print ">" + record.id
  for outputLine in outputLines:
    print outputLine
ADD COMMENT
2
Entering edit mode
12.4 years ago

this simple C program should discard all the non ATGC nucleotides:

#include <stdio.h>
int main(int argc,char** argv)
    {
    int c;
    int N=0;
    do
        {
        c=fgetc(stdin);
        switch(c)
            {
            case EOF:
            case '>':
                {
                if(N!=0) fputc('\n',stdout);
                if(c==EOF) break;
                fputc(c,stdout);
                while((c=fgetc(stdin))!=EOF && c!='\n')
                    {
                    fputc(c,stdout);
                    }
                fputc('\n',stdout);
                N=0;
                break;
                }
            case 'A': case 'a':
            case 'T': case 't':
            case 'G': case 'g':
            case 'C': case 'c':
                {
                if(N==60) { fputc('\n',stdout); N=0;}
                fputc(c,stdout);
                ++N;
                break;
                }
            default:break;
            }
        } while(c!=EOF);
    return 0;
    }
ADD COMMENT
0
Entering edit mode
12.1 years ago
Flashton • 0

A colleague helped me do this in quite a neat way using regular expressions in python (biopython required).

from Bio import SeqIO
import re

handle = open("/Users/phil/Documents/Infernal/NCBI_nr_clean.fasta", "rU")

sequences={}

for record in SeqIO.parse(handle, "fasta") :
    #replace any non-GATC characters in your sequences with nothing
    sequence = re.sub('[^GATC]', "", str(record.seq).upper())
    #discard sequences that are less than 60 bp long
    if (len(sequence))>=60:
        sequences[sequence]=record.id

output_file=open("output.fasta","w+")

for sequence in sequences:
    output_file.write(">"+sequences[sequence]+"\n"+sequence+"\n")
output_file.close()
handle.close()
ADD COMMENT

Login before adding your answer.

Traffic: 1692 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6