How to remove sequence reads contains more than 1 X from multifasta file?
2
0
Entering edit mode
4.6 years ago
k.kathirvel93 ▴ 310

I have 5000 protein sequences in one multifasta file. I found more reads have gaps as X in their reads. So, want to eliminate those reads completely (Whole protein seq) from the file. I am keeping filter criteria as if a read contains morethan 2 X ( continuesly or anywhere in the read) should be removed. Thanks in advance for your help.

The input sequence looks like this

>Prot1
 ANSTVKKKKLLLYYYSSSEERXFGHYFGHYFGHFYVHFGFYVHCEDYHF 
>Prot2
ANSTVKKKKLLLYYYSSSEERXXXXXXXXXXXFGHYFGHYFGHFYVHFGFYVHCEDYHF 
>Prot3
ANSTVKKKKLLLYYYSSSEERFGHYFGHYFGHFYVHFGFYVHCEDYHF
>Prot4
ANSTVKKKKLLLYYYSSSEEXFGHYFGHYFGHFYVXXFGFYVHCEDYHF

I want output Like this

 >Prot1
 ANSTVKKKKLLLYYYSSSEERXFGHYFGHYFGHFYVHFGFYVHCEDYHF 
 >Prot3  
 ANSTVKKKKLLLYYYSSSEERFGHYFGHYFGHFYVHFGFYVHCEDYHF
sequence sequencing • 2.9k views
ADD COMMENT
2
Entering edit mode

It seems to me that you keep asking similar question without making any effort to solve any of them on your own. Here you asked how to remove sequences containing N characters. Well, removing sequences with Xs is the same as removing Ns - you only change one letter in the script. You should have plenty of material with all the scripts others have written for you to solve this kind of problem with minimal effort. Most people here are helpful and kind enough to do this for you, but you will help yourself in the long run if you actually learn how to do it. You know that quote about teaching a man to fish?

enter image description here

ADD REPLY
0
Entering edit mode

Thanks for the advice.

Here you asked how to remove sequences containing N characters ?

Still i didn't get any solution for this, which is why i came up with a new thread. i tried with googlong also, but i couldn't get any. If i am better in scripting i could've done by myself. Taking ideas for the first time from others and learning from that is also fishing.

ADD REPLY
1
Entering edit mode

For example, this solution from that thread:

import sys
from Bio import SeqIO
for record in SeqIO.parse(sys.argv[1], "fasta"):
    if record.seq.count('N') == 0:
        print(record.format("fasta"))

Change N to X, and ==0 to < 2 and there is your solution. This is at least your fourth fishing lesson.

ADD REPLY
0
Entering edit mode

Still i didn't get any solution for this, which is why i came up with a new thread.

When I look at that thread, I see at least three solutions. Maybe they are not to your exact liking, but it isn't true that you didn't get any solutions.

ADD REPLY
0
Entering edit mode

This type of task is a good learning opportunity. It could be done in about 5 or 6 lines of python for instance.

What have you attempted so far?

ADD REPLY
0
Entering edit mode

If fasta headers don't have x, this should work with seqkit:

seqkit fx2tab test.fa | awk -F'X' 'NF<=2'  | seqkit tab2fx                                                                   
>Prot1
 ANSTVKKKKLLLYYYSSSEERXFGHYFGHYFGHFYVHFGFYVHCEDYHF 
>Prot3
ANSTVKKKKLLLYYYSSSEERFGHYFGHYFGHFYVHFGFYVHCEDYHF

$cat test.fa                                                                                                                  
>Prot1
 ANSTVKKKKLLLYYYSSSEERXFGHYFGHYFGHFYVHFGFYVHCEDYHF 
>Prot2
ANSTVKKKKLLLYYYSSSEERXXXXXXXXXXXFGHYFGHYFGHFYVHFGFYVHCEDYHF 
>Prot3
ANSTVKKKKLLLYYYSSSEERFGHYFGHYFGHFYVHFGFYVHCEDYHF
>Prot4
ANSTVKKKKLLLYYYSSSEEXFGHYFGHYFGHFYVXXFGFYVHCEDYHF
ADD REPLY
3
Entering edit mode
4.6 years ago

awk -v RS='>' '!/XX/{printf $0RT}' test.fasta

Output

Prot1

ANSTVKKKKLLLYYYSSSEERXFGHYFGHYFGHFYVHFGFYVHCEDYHF

Prot3

ANSTVKKKKLLLYYYSSSEERFGHYFGHYFGHFYVHFGFYVHCEDYHF

ADD COMMENT
1
Entering edit mode

This doesn't work if sequence has scattered "X". OP requirement is "I am keeping filter criteria as if a read contains morethan 2 X ( continuesly or anywhere in the read) should be removed".

For eg.

output:

 $ awk -v RS='>' '!/XX/{printf $0RT}' file.fa                                                                              

>seq1
ATWDXGCX
>seq3
ATWDXGCC

with seqkit and awk:

$ seqkit fx2tab file.fa | awk -F'X' 'NF<=2'  | seqkit tab2fx                                                                   
>seq3
ATWDXGCC

Input:

$ cat file.fa                                                                                                                  
>seq1
ATWDXGCX
>seq2
ATWXXGCX
>seq3
ATWDXGCC

awk --version GNU Awk 5.0.1,

ADD REPLY
1
Entering edit mode
4.6 years ago
 awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa |\
awk -F '\t' '{gsub("X","",$2);print}' |\
sort -t $'\t' -k2,2 | uniq -f 1 -u | tr "\t" "\n"

>Prot3
ANSTVKKKKLLLYYYSSSEERFGHYFGHYFGHFYVHFGFYVHCEDYHF
ADD COMMENT
0
Entering edit mode

Hi Mr. Pierre Lindenbaum,

Thanks for the fast reply, your awk script is removing fine reads( with no X) also. can you help again. and also i made few changes on my question, can you help with this modified thread ?

ADD REPLY

Login before adding your answer.

Traffic: 2241 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6