I have 5000 protein sequences in one multifasta file. I found more reads have gaps as X in their reads. So, want to eliminate those reads completely (Whole protein seq) from the file. I am keeping filter criteria as if a read contains morethan 2 X ( continuesly or anywhere in the read) should be removed. Thanks in advance for your help.
The input sequence looks like this
>Prot1
ANSTVKKKKLLLYYYSSSEERXFGHYFGHYFGHFYVHFGFYVHCEDYHF
>Prot2
ANSTVKKKKLLLYYYSSSEERXXXXXXXXXXXFGHYFGHYFGHFYVHFGFYVHCEDYHF
>Prot3
ANSTVKKKKLLLYYYSSSEERFGHYFGHYFGHFYVHFGFYVHCEDYHF
>Prot4
ANSTVKKKKLLLYYYSSSEEXFGHYFGHYFGHFYVXXFGFYVHCEDYHF
I want output Like this
>Prot1
ANSTVKKKKLLLYYYSSSEERXFGHYFGHYFGHFYVHFGFYVHCEDYHF
>Prot3
ANSTVKKKKLLLYYYSSSEERFGHYFGHYFGHFYVHFGFYVHCEDYHF
It seems to me that you keep asking similar question without making any effort to solve any of them on your own. Here you asked how to remove sequences containing
N
characters. Well, removing sequences with Xs is the same as removing Ns - you only change one letter in the script. You should have plenty of material with all the scripts others have written for you to solve this kind of problem with minimal effort. Most people here are helpful and kind enough to do this for you, but you will help yourself in the long run if you actually learn how to do it. You know that quote about teaching a man to fish?Thanks for the advice.
Still i didn't get any solution for this, which is why i came up with a new thread. i tried with googlong also, but i couldn't get any. If i am better in scripting i could've done by myself. Taking ideas for the first time from others and learning from that is also fishing.
For example, this solution from that thread:
Change
N
toX
, and==0
to< 2
and there is your solution. This is at least your fourth fishing lesson.When I look at that thread, I see at least three solutions. Maybe they are not to your exact liking, but it isn't true that you didn't get
any
solutions.This type of task is a good learning opportunity. It could be done in about 5 or 6 lines of python for instance.
What have you attempted so far?
If fasta headers don't have x, this should work with seqkit: