removed entrie sequence include specific letters
2
0
Entering edit mode
5.9 years ago
Jason ▴ 10

Hey All

I need help with shell command or Perl script.

I have 400 sequences and some of these sequences have X,B and Z.

I want to remove an entire sequence from fasta file that has X, B,Z

I found this shell command that will remove only this letter from the sequence

sed '/^[^>]/s/[X||Z|B]//g' input_file.fasta > output_file.fasta

But my goal to remove any sequence include these letters.

All my sequences in one line

for example:

>sp|Q9M7X9|CITRX_ARATH Thioredoxin-like protein CITRX, chloroplastic OS=Arabidopsis thaliana OX=3702 GN=CITRX PE=1 SV=1
MALVQSRTFPHLNTPLSPILSSLHAPSSLFIXREIRPVAAPXXSSTAGNLPFSPLTRPRKLLCPPPRGKFVREDYLVKKLSAQELQELVKGDRKVPLIVDFYATWCGPCILMAQELEMLAVEYESNAIIVKVDTDDEYEFARDMQVRGLPTLFFISPDPSKDAIRTEGLIPLQMMHDIIDNEM
>sp|P22217|TRX1_YEAST Thioredoxin-1 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) OX=559292 GN=TRX1 PE=1 SV=3
MVTQFKTASEFDSAIAQDKLVVVDFYATWCGPCKMIAPMIEKFSEQYPQADFYKLDVDELGDVAQKNEVSAMPTLLLFKNGKEVAKVVGANPAAIKQAIAANA
>sp|P22803|TRX2_YEAST Thioredoxin-2 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) OX=559292 GN=TRX2 PE=1 SV=3
MVTQLKSASEYDSALASGDKLVVVDFFATWCGPCKMIAPMIEKFAEQYSDAAFYKLDVDEVSDVAQKAEVSSMPTLIFYKGGKEVTRVVGANPAAIKQAIASNV
>sp|Q99MD6|TRXR3_MOUSE Thioredoxin reductase 3 OS=Mus musculus OX=10090 GN=Txnrd3 PE=1 SV=3
MEKPPSPPPPPRAQTSPGLGKVGVLPNRRLGAVRGGLMSBBRRARLASPGTSRPSSEAREELRRRLRDLIEGNRVMIFSKSYCPHSTRVKELFSSLGVVYNILELDQVDDGASVQEVLTEISNQKTVPNIFV

the result will remove entire sequences include X, B, and Z

>sp|P22217|TRX1_YEAST Thioredoxin-1 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) OX=559292 GN=TRX1 PE=1 SV=3 
MVTQFKTASEFDSAIAQDKLVVVDFYATWCGPCKMIAPMIEKFSEQYPQADFYKLDVDELGDVAQKNEVSAMPTLLLFKNGKEVAKVVGANPAAIKQAIAANA
>sp|P22803|TRX2_YEAST Thioredoxin-2 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) OX=559292 GN=TRX2 PE=1 SV=3 
MVTQLKSASEYDSALASGDKLVVVDFFATWCGPCKMIAPMIEKFAEQYSDAAFYKLDVDEVSDVAQKAEVSSMPTLIFYKGGKEVTRVVGANPAAIKQAIASNV
sequencing • 1.4k views
ADD COMMENT
1
Entering edit mode

Will you remove sequences containing J?

    A   Ala Alanine
    B   Asx Aspartic acid or Asparagine [2]
    C   Cys Cysteine
    D   Asp Aspartic Acid
    E   Glu Glutamic Acid
    F   Phe Phenylalanine
    G   Gly Glycine
    H   His Histidine
    I   Ile Isoleucine
    J       Isoleucine or Leucine [4]
    K   Lys Lysine
    L   Leu Leucine
    M   Met Methionine
    N   Asn Asparagine
    O       pyrrolysine [6]
    P   Pro Proline
    Q   Gln Glutamine
    R   Arg Arginine
    S   Ser Serine
    T   Thr Threonine
    U   Sec selenocysteine [5,6]
    V   Val Valine
    W   Trp Tryptophan
    Y   Tyr Tyrosine
    Z   Glx Glutamine or Glutamic acid [2]
    X   unknown amino acid
    .   gaps
    *   End
Reference:
    1. http://www.bioinformatics.org/sms/iupac.html
    2. http://www.dnabaser.com/articles/IUPAC%20ambiguity%20codes.html
    3. http://www.bioinformatics.org/sms2/iupac.html
    4. http://www.matrixscience.com/blog/non-standard-amino-acid-residues.html
    5. http://www.sbcs.qmul.ac.uk/iupac/AminoAcid/A2021.html#AA21
    6. https://en.wikipedia.org/wiki/Amino_acid

ADD REPLY
1
Entering edit mode
5.9 years ago
JC 13k

You don't need a substitution, search for a match in your sequence:

#!/usr/bin/perl

use strict;
use warnings;

$/ = "\n>"; # Read Fasta sequences in blocks
while (<>) {
    s/>//g;
    my ($seq_id, @seq) = split (/\n/, $_);
    my $seq = join "", @seq;
    next if ($seq =~ m/[ZXB]/); # skip sequences with Z, X or B
    print ">$_";
}

Usage: perl removeSeqs.pl < FASTA_IN > FASTA_OUT

ADD COMMENT
0
Entering edit mode
5.9 years ago

Try seqkit grep (usage).

seqkit grep -i -s -r -p '[zxb]' -v

# cat test.fa | seqkit grep --ignore-case --by-seq --use-regexp --pattern '[zxb]' --invert-match

#  seqkit grep -i -s -p z -p x -p b -v
ADD COMMENT

Login before adding your answer.

Traffic: 2168 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6