Hi,
I have a fasta file. From that fasta file, I need to extract only those sequences that have Ns nucleotides.
>seq1
AGCGGCGTAACGTCGTAGTC
>seq2
ACGCGTACNNNNNNTGCGA
I want output like this:
>seq1
AGCGGCGTAACGTCGTAGTC
Regards,
Waqas
Hi,
I have a fasta file. From that fasta file, I need to extract only those sequences that have Ns nucleotides.
>seq1
AGCGGCGTAACGTCGTAGTC
>seq2
ACGCGTACNNNNNNTGCGA
I want output like this:
>seq1
AGCGGCGTAACGTCGTAGTC
Regards,
Waqas
linearize the fasta
filter with awk and a regular expression, convert back to fasta
awk -f linearizefasta.awk < input.fa | awk -F '\t' '($2 ~ /N/)' | tr "\t" "\n"
With one-line fasta and with sequence names specifically 'seq#' (i.e. no letter 'N's), you can super easily do:
grep -B 1 'N' input.fasta >output_no_Ns.fasta
But that is assuming your data is in the specific format above. Multi line fasta or N's in your headers would break it. But this is what I'd do if I had this data.
Hi,
First, I linearized multi-line fasta file to single-line fasta file:
Multiline Fasta To Single Line Fasta
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' < file.fa
Than, @Daniel command:
grep -B 1 'N' input.fasta >output_no_Ns.fasta
I apologized the community for not explaining my question properly.
Best,
Waqas.
In Python, for a file named foo.fa containing linearised FASTAs (prunes FASTAs with Ns):
header = ''
for line in open('foo.fa'):
if '>' in line:
header = line
elif line == '\n':
pass
elif 'N' not in line.upper():
print header,
print line,
With the BBMap package:
reformat.sh in=file.fasta out=fixed.fasta maxns=0
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Your example has no Ns, which is the exact opposite of what you say you want.
Are the sequences always on a single line?
You mean, you want to exclude the sequences with Ns?
Note, this only work if you don't have multi-line fasta