Entering edit mode
5.9 years ago
Jason
▴
10
Hey All
I need help with shell command or Perl script.
I have 400 sequences and some of these sequences have X,B and Z.
I want to remove an entire sequence from fasta file that has X, B,Z
I found this shell command that will remove only this letter from the sequence
sed '/^[^>]/s/[X||Z|B]//g' input_file.fasta > output_file.fasta
But my goal to remove any sequence include these letters.
All my sequences in one line
for example:
>sp|Q9M7X9|CITRX_ARATH Thioredoxin-like protein CITRX, chloroplastic OS=Arabidopsis thaliana OX=3702 GN=CITRX PE=1 SV=1
MALVQSRTFPHLNTPLSPILSSLHAPSSLFIXREIRPVAAPXXSSTAGNLPFSPLTRPRKLLCPPPRGKFVREDYLVKKLSAQELQELVKGDRKVPLIVDFYATWCGPCILMAQELEMLAVEYESNAIIVKVDTDDEYEFARDMQVRGLPTLFFISPDPSKDAIRTEGLIPLQMMHDIIDNEM
>sp|P22217|TRX1_YEAST Thioredoxin-1 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) OX=559292 GN=TRX1 PE=1 SV=3
MVTQFKTASEFDSAIAQDKLVVVDFYATWCGPCKMIAPMIEKFSEQYPQADFYKLDVDELGDVAQKNEVSAMPTLLLFKNGKEVAKVVGANPAAIKQAIAANA
>sp|P22803|TRX2_YEAST Thioredoxin-2 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) OX=559292 GN=TRX2 PE=1 SV=3
MVTQLKSASEYDSALASGDKLVVVDFFATWCGPCKMIAPMIEKFAEQYSDAAFYKLDVDEVSDVAQKAEVSSMPTLIFYKGGKEVTRVVGANPAAIKQAIASNV
>sp|Q99MD6|TRXR3_MOUSE Thioredoxin reductase 3 OS=Mus musculus OX=10090 GN=Txnrd3 PE=1 SV=3
MEKPPSPPPPPRAQTSPGLGKVGVLPNRRLGAVRGGLMSBBRRARLASPGTSRPSSEAREELRRRLRDLIEGNRVMIFSKSYCPHSTRVKELFSSLGVVYNILELDQVDDGASVQEVLTEISNQKTVPNIFV
the result will remove entire sequences include X, B, and Z
>sp|P22217|TRX1_YEAST Thioredoxin-1 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) OX=559292 GN=TRX1 PE=1 SV=3
MVTQFKTASEFDSAIAQDKLVVVDFYATWCGPCKMIAPMIEKFSEQYPQADFYKLDVDELGDVAQKNEVSAMPTLLLFKNGKEVAKVVGANPAAIKQAIAANA
>sp|P22803|TRX2_YEAST Thioredoxin-2 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) OX=559292 GN=TRX2 PE=1 SV=3
MVTQLKSASEYDSALASGDKLVVVDFFATWCGPCKMIAPMIEKFAEQYSDAAFYKLDVDEVSDVAQKAEVSSMPTLIFYKGGKEVTRVVGANPAAIKQAIASNV
Will you remove sequences containing
J
?