How to extract only sequences that contain a specific percentage of Cysteines (C)
3
0
Entering edit mode
5.8 years ago
kamel ▴ 70

Hello Biostar
I have a multifasta file that contains protein sequences (of different sizes), I want to extract only those sequences that contain at least 5% of cysteines (C).
Can you give me a little script so that I can extract these sequences
Thank you in advance

proteins sequence • 2.4k views
ADD COMMENT
1
Entering edit mode

See this updated post:

https://web.archive.org/web/20071027112709/http://python.genedrift.org/2007/10/10/alternative-methods-to-split-a-fasta-file/

Let's use Python:

make your sequences looks like 2 strings each:

> header
1-line sequence

def c_part_in_string(s): 
    countC = 0

    for i in range (0, len(s)):
        if s[i] == "C":
            countC += 1
    return countC/len(s)

    for line in my_list:
        if line.startswith('>'):
            print(line)
        else:
            if c_part_in_string(line) > 0.05: 
                print(line)

This is just an idea.

ADD REPLY
0
Entering edit mode

Sorry, we cannot "give you a little script". Please tell us what you've tried so far and where you're facing difficulties, and we can help you with specifics.

ADD REPLY
0
Entering edit mode

I can quantify the number of C (cystein) by grep but in each sequence (ID). I have several protein sequences in a single multifasta file, I am looking for an idea to quantify the number of C in each sequence at a time, for extract the sequences which contains 5% cysteine.

ADD REPLY
2
Entering edit mode
5.8 years ago
JC 13k

Definitively you need some programming skills to do it by yourself. Here is some Perl code to do this:

#!/usr/bin/perl

use strict;
use warnings;
my $minC = 5; # Filter sequence below this percentage

$/ = "\n>"; # Read Fasta sequences in blocks
while (<>) {
    s/>//g;
    my ($seq_id, @seq) = split (/\n/, $_); # Separate sequence id from the sequence lines
    my $seq = uc (join ("", @seq)); # Linearize the sequence in a single string, also using upper-case letters
    my $len = length $seq; # Get sequence size
    my $numC = $seq =~ tr/C/C/; # Quick way to count chars in a string
    my $perC = 100 * $numC / $len; # Calc percentage of Cs
    if ($perC >= $minC) { # Filter
         print ">$_";
    }
}

Usage: perl filterByC.pl < Fasta_In > Fasta_out

ADD COMMENT
0
Entering edit mode

Thanks for the help @JC

ADD REPLY
2
Entering edit mode
5.8 years ago

With awk:

$ awk -v FS="\n" -v RS=">" '$0 { seq=$0; $1="";  c=gsub("C", "_", $0); l=gsub(/[A-Z_]/, ".", $0); if(c/l>0.05) printf ">"seq}' input.fa
  • We set > as the record separator and \n as a the field separator.
  • If our record is not empty we save the whole record to seq.
  • We remove the id because we don't want to count the C there.
  • Then we count the replacement made from C to _ to get the number of C.
  • To get the total number of amino acid we count the replacements of each letter and a _.
  • If the fraction is greater 0.05 we print out the original sequence.
ADD COMMENT
0
Entering edit mode

Thanks for the help @finswimmer this is very useful

ADD REPLY
0
Entering edit mode
5.8 years ago

with seqkit and awk:

$ seqkit fx2tab -B "C" test.txt | awk -v OFS="\n" '$3>=10 {print ">"$1,$2}'
>a
atgccc
>c
ac
>d
c

$ seqkit fx2tab -B "C" test.txt | awk -v OFS="\n" '$3>=100 {print ">"$1,$2}'
>d
c

input:

$cat test.txt 
>a
atgccc
>b
atc
>c
ac
>d
c
ADD COMMENT
0
Entering edit mode

Thanks for the help

ADD REPLY

Login before adding your answer.

Traffic: 2452 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6