How to extract a region of protein or small domain '150-300' from a long sequences of protein from a protein length of 2000AA. through command.
3
0
Entering edit mode
9.5 years ago

Hello all,

I have a large number of protein made up of 1000 to 2000 amino acids. Now in these I want to extract only a small region for example '150-300' from each protein. How can I do this through terminal? I need to extract different region from different protein what I have to do.

programming protein-sequence command • 3.2k views
ADD COMMENT
0
Entering edit mode

Well do you know which sequences are involved in your processing and which regions you want to splice out? If yes then:

perl -lne 'chomp; if($l ==1){push(@a, split("",$_))}; if(/>(.*)/){if(@a){print splice @a, 0,100; @a = ()}; if( $1 =~ /1/){$l = 1;print $_}else{$l=0;}}; ' in_file

NOTE: code not tested!!

the idea is to locate a sequence: $1 =~ /my_seq_id/ and then extract a region starting at position 0 of length 100 (print splice @a, 0,100;) Please note that the above line of code can be extended to extract the same position within any sequence sharing some regex in its header.

ADD REPLY
0
Entering edit mode

Thank you sir, but if I need to extract different region from different protein what I have to do?

ADD REPLY
0
Entering edit mode

Repeat the procedure:

Extract 122-222 from sequence xxcx:

perl -lne 'chomp; if($l ==1){push(@a, split("",$_))}; if(/>(.*)/){if(@a){print splice @a, 121,100; @a = ()}; if( $1 =~ /xxcx/){$l = 1;print $_}else{$l=0;}}; ' in_file

Extract 22-522 from sequence yyy:

perl -lne 'chomp; if($l ==1){push(@a, split("",$_))}; if(/>(.*)/){if(@a){print splice @a, 21,500; @a = ()}; if( $1 =~ /yyy/){$l = 1;print $_}else{$l=0;}}; ' in_file
ADD REPLY
0
Entering edit mode

Write a script like python or perl and run it from terminal, which will do your job!

ADD REPLY
1
Entering edit mode
9.5 years ago
Lesley Sitter ▴ 610

In python you could do something like this

import sys
file_open = open(sys.argv[1],'r').readlines()
file_out = open("outputfile.txt",'w')
for lines in file_open:
    if '>' in lines:
        file_out.write(lines)
    else:
        file_out.write(lines[149:300]+"\n")

You can then start this script by typing

python my_script_name.py my_input_file.fasta
ADD COMMENT
1
Entering edit mode
9.5 years ago

Prepare a list with the regions you want to extract (fasta-header, start,end)

$ cat list.txt
>gi|72201761    10    20
>gi|14666145    11    40
>gi|83671954    13    50

Linearize the fasta, join both sorted files, extract the substring with awk

join -t '      ' -1 1 -2 1 <(awk '/>/ {printf("\n%s\t",$0);next;} {printf("%s",$0);} END {printf("\n");}' input.fasta  | sort -k1,1) <(sort -k1,1 list.tsv) | awk -F '     ' '{printf("%s(%s-%s)\n%s\n",$1,$3,$4,substr($2,int($3),int($4)-int($3)));}'
>gi|14666145(11-40)
CATTCATCATGATAACAGCACCCTAAATA
>gi|72201761(10-20)
GTGCCATTTA
>gi|83671954(13-50)
CCGGAATTCCCGGGATATCGTCGACCCACGCGTCCGC
ADD COMMENT
0
Entering edit mode
9.5 years ago
venu 7.1k

You can use this perl script. It also works if it is MSA file.

$ perl select_sites.pl  file_name.faa  <start_position>  <end_position>

or

grep -v '^>' file_name.faa | tr -d "\n" | cut -c 150-300

but this works well when there is one sequence per one file. However you can loop it over all the files.

ADD COMMENT

Login before adding your answer.

Traffic: 3116 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6