Question

How to extract only sequences that contain a specific percentage of Cysteines (C)

0

Entering edit mode

6.4 years ago

kamel ▴ 70

Hello Biostar
I have a multifasta file that contains protein sequences (of different sizes), I want to extract only those sequences that contain at least 5% of cysteines (C).
Can you give me a little script so that I can extract these sequences
Thank you in advance

proteins sequence • 2.9k views

ADD COMMENT • link updated 6.4 years ago by cpad0112 21k • written 6.4 years ago by kamel ▴ 70

1

Entering edit mode

See this updated post:

https://web.archive.org/web/20071027112709/http://python.genedrift.org/2007/10/10/alternative-methods-to-split-a-fasta-file/

Let's use Python:

make your sequences looks like 2 strings each:

> header
1-line sequence

def c_part_in_string(s): 
    countC = 0

    for i in range (0, len(s)):
        if s[i] == "C":
            countC += 1
    return countC/len(s)

    for line in my_list:
        if line.startswith('>'):
            print(line)
        else:
            if c_part_in_string(line) > 0.05: 
                print(line)

This is just an idea.

ADD REPLY • link updated 6.4 years ago by Ram 45k • written 6.4 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

Sorry, we cannot "give you a little script". Please tell us what you've tried so far and where you're facing difficulties, and we can help you with specifics.

ADD REPLY • link 6.4 years ago by Ram 45k

0

Entering edit mode

I can quantify the number of C (cystein) by grep but in each sequence (ID). I have several protein sequences in a single multifasta file, I am looking for an idea to quantify the number of C in each sequence at a time, for extract the sequences which contains 5% cysteine.

ADD REPLY • link 6.4 years ago by kamel ▴ 70

0

Entering edit mode

6.4 years ago

cpad0112 21k

with seqkit and awk:

$ seqkit fx2tab -B "C" test.txt | awk -v OFS="\n" '$3>=10 {print ">"$1,$2}'
>a
atgccc
>c
ac
>d
c

$ seqkit fx2tab -B "C" test.txt | awk -v OFS="\n" '$3>=100 {print ">"$1,$2}'
>d
c

input:

$cat test.txt 
>a
atgccc
>b
atc
>c
ac
>d
c

ADD COMMENT • link 6.4 years ago by cpad0112 21k

0

Entering edit mode

Thanks for the help

ADD REPLY • link 6.4 years ago by kamel ▴ 70

score 2 · Accepted Answer · 2019-02-15

Definitively you need some programming skills to do it by yourself. Here is some Perl code to do this:

#!/usr/bin/perl

use strict;
use warnings;
my $minC = 5; # Filter sequence below this percentage

$/ = "\n>"; # Read Fasta sequences in blocks
while (<>) {
    s/>//g;
    my ($seq_id, @seq) = split (/\n/, $_); # Separate sequence id from the sequence lines
    my $seq = uc (join ("", @seq)); # Linearize the sequence in a single string, also using upper-case letters
    my $len = length $seq; # Get sequence size
    my $numC = $seq =~ tr/C/C/; # Quick way to count chars in a string
    my $perC = 100 * $numC / $len; # Calc percentage of Cs
    if ($perC >= $minC) { # Filter
         print ">$_";
    }
}

Usage: perl filterByC.pl < Fasta_In > Fasta_out

score 2 · Accepted Answer · 2019-02-16

2

Entering edit mode

6.4 years ago

finswimmer 16k

With awk:

$ awk -v FS="\n" -v RS=">" '$0 { seq=$0; $1="";  c=gsub("C", "_", $0); l=gsub(/[A-Z_]/, ".", $0); if(c/l>0.05) printf ">"seq}' input.fa

We set > as the record separator and \n as a the field separator.
If our record is not empty we save the whole record to seq.
We remove the id because we don't want to count the C there.
Then we count the replacement made from C to _ to get the number of C.
To get the total number of amino acid we count the replacements of each letter and a _.
If the fraction is greater 0.05 we print out the original sequence.