Fasta files typically contain invalid characters such as "NNNN", how can I remove those and then make sure I still have e.g. a 60 nucleotides per line?
I feel like there's probably a package that does this already.
Fasta files typically contain invalid characters such as "NNNN", how can I remove those and then make sure I still have e.g. a 60 nucleotides per line?
I feel like there's probably a package that does this already.
You could this in R with the seqinR package:
library(seqinr)
x=read.fasta("myoriginal.fasta")
new=lapply(seq(length(x)), function(i) {
s2c(gsub("N","",c2s(getSequence(x[[i]]))))
})
write.fasta(new,names=names(x),file="newfile.fasta",nbchar=60)
But the N characters do have a meaning, they are not 'invalid nucleotides'. They mean that this base is not fully determined in the sequence you have, so you should make sure taking them out is really what you want to do...
try this kinda ugly python script: (need biopython)
import sys
from Bio import SeqIO
faFile = open('your file.fa','r')
for record in SeqIO.parse(faFile,'fasta'):
filteredSeq = str(record.seq).lower().replace('n','')
outputLines = [filteredSeq[i:i+60] for i in range(0, len(filteredSeq), 60)]
print ">" + record.id
for outputLine in outputLines:
print outputLine
this simple C program should discard all the non ATGC nucleotides:
#include <stdio.h>
int main(int argc,char** argv)
{
int c;
int N=0;
do
{
c=fgetc(stdin);
switch(c)
{
case EOF:
case '>':
{
if(N!=0) fputc('\n',stdout);
if(c==EOF) break;
fputc(c,stdout);
while((c=fgetc(stdin))!=EOF && c!='\n')
{
fputc(c,stdout);
}
fputc('\n',stdout);
N=0;
break;
}
case 'A': case 'a':
case 'T': case 't':
case 'G': case 'g':
case 'C': case 'c':
{
if(N==60) { fputc('\n',stdout); N=0;}
fputc(c,stdout);
++N;
break;
}
default:break;
}
} while(c!=EOF);
return 0;
}
A colleague helped me do this in quite a neat way using regular expressions in python (biopython required).
from Bio import SeqIO
import re
handle = open("/Users/phil/Documents/Infernal/NCBI_nr_clean.fasta", "rU")
sequences={}
for record in SeqIO.parse(handle, "fasta") :
#replace any non-GATC characters in your sequences with nothing
sequence = re.sub('[^GATC]', "", str(record.seq).upper())
#discard sequences that are less than 60 bp long
if (len(sequence))>=60:
sequences[sequence]=record.id
output_file=open("output.fasta","w+")
for sequence in sequences:
output_file.write(">"+sequences[sequence]+"\n"+sequence+"\n")
output_file.close()
handle.close()
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Others have pointed out that "N" is not invalid; it means "base not known". I'll point it out again, because it's important.