Hi everyone,
i have a giant fasta file, but some of the sequences have got Ns in them
GeneID:107003026
AAATTTACTTGTCCTTGTGAT
GeneID:107005138
TATGCACNNNGGTTGC
GeneID:107004481
GATTTTATGTTGCTGAA
so the second one has got Ns in them, what can i do to get rid of the whole sequence so that the outcome would look like this? thank you very much
GeneID:107003026
AAATTTACTTGTCCTTGTGAT
GeneID:107004481
GATTTTATGTTGCTGAA
did you search for similar posts on biostars.org ? what did you find ? what have you tried ?
There are many ways to do this correctly, but are you sure you want to? What is your rationale?
i am doing analysis on promoter sequences for two close species in which i would need to align them together and see the similarity , so i would need to use may be blast, but it just gives me error when i tried to do that in R when sequences contain Ns, so i guess i would just have to ignore the sequences that have Ns.
That doesn't look like a FASTA file (no ">").
yeah, there are some '>' s in my file, they are just gone when i posted them here for some reason
If you want to use only the default unix tools, you can use grep to filter out Ns (assuming your sequence names do not have Ns):
Then filter out empty records (where sequence was removed by grep):
I do not recommend this, as it is unsafe. A good solution should handle all possible Fasta variants, whether they are multi-line, contain 'N' in headers, etc.
It didn't look like any of those sequences were in danger of being multi-line. To be safe, you can convert multi-line fasta to single-line fasta: