Question

How to extract a region of protein or small domain '150-300' from a long sequences of protein from a protein length of 2000AA. through command.

0

Entering edit mode

9.3 years ago

sanjeet00001992 • 0

Hello all,

I have a large number of protein made up of 1000 to 2000 amino acids. Now in these I want to extract only a small region for example '150-300' from each protein. How can I do this through terminal? I need to extract different region from different protein what I have to do.

programming protein-sequence command • 3.1k views

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 9.3 years ago by sanjeet00001992 • 0

0

Entering edit mode

Well do you know which sequences are involved in your processing and which regions you want to splice out? If yes then:

perl -lne 'chomp; if($l ==1){push(@a, split("",$_))}; if(/>(.*)/){if(@a){print splice @a, 0,100; @a = ()}; if( $1 =~ /1/){$l = 1;print $_}else{$l=0;}}; ' in_file

NOTE: code not tested!!

the idea is to locate a sequence: $1 =~ /my_seq_id/ and then extract a region starting at position 0 of length 100 (print splice @a, 0,100;) Please note that the above line of code can be extended to extract the same position within any sequence sharing some regex in its header.

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.3 years ago by mxs ▴ 530

0

Entering edit mode

Thank you sir, but if I need to extract different region from different protein what I have to do?

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.3 years ago by sanjeet00001992 • 0

0

Entering edit mode

Repeat the procedure:

Extract 122-222 from sequence xxcx:

perl -lne 'chomp; if($l ==1){push(@a, split("",$_))}; if(/>(.*)/){if(@a){print splice @a, 121,100; @a = ()}; if( $1 =~ /xxcx/){$l = 1;print $_}else{$l=0;}}; ' in_file

Extract 22-522 from sequence yyy:

perl -lne 'chomp; if($l ==1){push(@a, split("",$_))}; if(/>(.*)/){if(@a){print splice @a, 21,500; @a = ()}; if( $1 =~ /yyy/){$l = 1;print $_}else{$l=0;}}; ' in_file

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.3 years ago by mxs ▴ 530

0

Entering edit mode

Write a script like python or perl and run it from terminal, which will do your job!

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.3 years ago by Pallab Bhowmick ▴ 20

Ram · Answer 1 · 2015-07-30

1

Entering edit mode

9.3 years ago

Lesley Sitter ▴ 610

In python you could do something like this

import sys
file_open = open(sys.argv[1],'r').readlines()
file_out = open("outputfile.txt",'w')
for lines in file_open:
    if '>' in lines:
        file_out.write(lines)
    else:
        file_out.write(lines[149:300]+"\n")

You can then start this script by typing

python my_script_name.py my_input_file.fasta

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 9.3 years ago by Lesley Sitter ▴ 610

Ram · Answer 2 · 2015-07-30

Prepare a list with the regions you want to extract (fasta-header, start,end)

$ cat list.txt
>gi|72201761    10    20
>gi|14666145    11    40
>gi|83671954    13    50

Linearize the fasta, join both sorted files, extract the substring with awk

join -t '      ' -1 1 -2 1 <(awk '/>/ {printf("\n%s\t",$0);next;} {printf("%s",$0);} END {printf("\n");}' input.fasta  | sort -k1,1) <(sort -k1,1 list.tsv) | awk -F '     ' '{printf("%s(%s-%s)\n%s\n",$1,$3,$4,substr($2,int($3),int($4)-int($3)));}'
>gi|14666145(11-40)
CATTCATCATGATAACAGCACCCTAAATA
>gi|72201761(10-20)
GTGCCATTTA
>gi|83671954(13-50)
CCGGAATTCCCGGGATATCGTCGACCCACGCGTCCGC

Ram · Answer 3 · 2015-07-30

0

Entering edit mode

9.3 years ago

venu 7.1k

You can use this perl script. It also works if it is MSA file.

$ perl select_sites.pl  file_name.faa  <start_position>  <end_position>

or

grep -v '^>' file_name.faa | tr -d "\n" | cut -c 150-300

but this works well when there is one sequence per one file. However you can loop it over all the files.

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 9.3 years ago by venu 7.1k