Fast sites removal from alignment
2
0
Entering edit mode
6.5 years ago
jan • 0

Hi all,

I have a file listing site-specific substitution rates (file.rate; please see below). In several steps (à 4,000 sites), I would like to remove the fastest evolving sites (shown in column "Rate") from my AA alignment (file.ali). I would be very happy and grateful if you could help out (maybe in python, perl, awk).

Site Rate Cat C_Rate

1 2.74582 4 2.74582

2 0.31646 2 0.31554

3 0.77656 3 0.88431

4 0.08958 1 0.05433

5 ... ...

The alignment file is a normal fasta (one-liner):

>Spec1
TDKCKPKKCHLECKKNCPIVKTGKSSKIAFISEMLCIGCGICVK
>Spec2
TDKCKPK----EKKKNCPIVGTGKSSKIAFISEMLCIGCGICVK
>Spec3
????KKCHLECKKNCPIVKTGK-----IAFISEMLCIGCGICVK
alignment sequence • 1.8k views
ADD COMMENT
0
Entering edit mode
6.5 years ago
Hugo ▴ 380

You can use awk in first place to obtain the list of sites. For instance, the following command prints only lines where "C_Rate" (fourth column, $4 in awk) is higher or equal than 1:

awk 'NR > 1 { if ($4 >= 1) print $0 }' sites.list

I would recommend you to get only the first column and append the "Spec" word and save it into a file:

awk 'NR > 1 { if ($4 >= 1) print "Spec"$1 }' sites.list > filtered.sites.list

Then you could go to SEDA www.sing-group.org/seda) load the filtered.sites.list in the "Pattern filtering" operation (http://www.sing-group.org/seda/manual/operations.html#pattern-filtering), and use it to filter your FASTA file.

ADD COMMENT
0
Entering edit mode

Thanks, but how is this removing sites from the alignment?

ADD REPLY
0
Entering edit mode

Ok, I understand now. How is your AA file? I think that you should use this awk filter first and then filter the remaining sites. Please, show us your AA file to see how this can be achieved.

ADD REPLY
0
Entering edit mode

I have updated my answer, hope this can be useful to you.

ADD REPLY
0
Entering edit mode

No, this is very cumbersome as I would have to find all the different thresholds to use by myself etc. Anyway, thanks for your suggestions.

ADD REPLY
0
Entering edit mode
6.5 years ago

If you only want remove the ONE fastest site, firstly get the site:

$ cat rate.tsv | csvtk sort -t -k Rate:nr | csvtk cut -t -f Site | sed -n 2p 
1

Supposing it's 3rd not 1, next, retrieve sub-sequence on the left and right, and then re-concatenate them.

$ seqkit subseq -r 1:2 seqs.fa  > left.fa
$ seqkit subseq -r 4:-1 seqs.fa > right.fa

$ seqkit concat left.fa right.fa 
>Spec1
TDCKPKKCHLECKKNCPIVKTGKSSKIAFISEMLCIGCGICVK
>Spec2
TDCKPK----EKKKNCPIVGTGKSSKIAFISEMLCIGCGICVK
>Spec3
???KKCHLECKKNCPIVKTGK-----IAFISEMLCIGCGICVK

If you want remove multiple sites, it's a bit completed for shell command, you'd better write a script.

ADD COMMENT

Login before adding your answer.

Traffic: 1625 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6