Perl code to get starting and ending position of list of peptides from the protein sequence of list of protein
4
0
Entering edit mode
9.9 years ago
genie66 ▴ 30

I have a list of protein names with a list of corresponding peptides. I want to find the starting and ending position of each peptide from their corresponding protein sequence! Since the number of peptides is very huge, it's not possible to do manually. How to do this using programming!! Please help me out

Example:

Protein        peptide
A1AT_HUMAN     LSITGTYDLK
A1AT_HUMAN     SVLGQLGITK
A1BG_HUMAN     NGVAQEPVHLDSPAIK
A1BG_HUMAN     SGLSTGWTQLSK
A2GL_HUMAN     DLLLPQPDLR
A2GL_HUMAN     VAAGAFQGLR
A4_HUMAN       LVFFAEDVGSNK
A4_HUMAN       THPHFVIPYR
A4_HUMAN       WYFDVTEGK
perl protein-sequence peptide-position • 9.1k views
ADD COMMENT
0
Entering edit mode

It's a published data, These peptides are used in the mass spec analysis, I need the position of the peptides from the corresponding proteins

ADD REPLY
1
Entering edit mode
9.9 years ago
Ram 44k

This would be my approach:

  1. Read file with ID and sequence (as shown in your question) into a hash, with the sequence ID as the key and an array of the sequences as the value - this is because you have multiple sequences to be searched per ID.
  2. Iterate through the FASTA file sequence by sequence
  3. For each sequence, run a Perl Regex match operation and print the match location (details on this here)

That should do the trick

ADD COMMENT
0
Entering edit mode

I am sorry. I am very new to programming. Can't this be done with a simple program?

ADD REPLY
0
Entering edit mode

Not really. I'd suggest a simple program, but it would be imperfect and not address the scenarios I see here, let alone contingencies for scenarios that might happen.

ADD REPLY
0
Entering edit mode

I'd suggest Python if you're new to programming, and you might have to go with string.index() method or use the technique shown here

ADD REPLY
0
Entering edit mode

Thanks for your suggestion. Can I use the same logic(which you mentioned for perl) in python to run through all peptide and protein sequences and find positions?

ADD REPLY
0
Entering edit mode

Yes, the logic should hold in Python.

ADD REPLY
1
Entering edit mode
9.9 years ago
Siva ★ 1.9k

Once you have the protein sequences from UniProt, you can use Perl index() which takes a string (protein sequence, in your case) and a substring (peptide sequence, in your case) to search against the string and returns the start position (0-based) of the first occurrence of the substring. You add the length of the peptide sequence to get the end position.

I hope you are taking in to account that there may be more than one match for your peptide sequence in the protein sequence. In that case, you want to check first how many occurrences of the peptide sequence in the protein sequence. For a general idea, check this sample code.

ADD COMMENT
0
Entering edit mode

Wouldn't a RegEx match with @+ be easier?

ADD REPLY
1
Entering edit mode

Wow. I did not know about these built-in variables. Thank you. Though I might not recommend to someone who is not familiar with Perl to work with built-in variables with all those confusing "a-cat-walked-across-my-keyboard" symbols :)

ADD REPLY
1
Entering edit mode

index() will be faster, so I wouldn't even worry about these variables. This task is exactly what index() was designed for, but of course, a regex solution can be applied to anything.

ADD REPLY
0
Entering edit mode

I agree. But it's not just the index() - it's the iteration to find all occurrences. That is where RegEx makes it easier.

ADD REPLY
1
Entering edit mode

Both approaches require a while() loop to find all matches, so there is no difference there. You need one more line to get the end position with index(), though I wouldn't say that regex is easier because of that. I would probably use index() because it is the more idiomatic choice, being designed specifically for this purpose, and it is also a bit more readable. I've also used regex for this purpose depending on the task. Outside of the performance between the two approaches, and there may be no difference, it probably doesn't really matter which method is used. Both are good choices and will be fast enough.

ADD REPLY
0
Entering edit mode

Thanks for the explanation.

ADD REPLY
0
Entering edit mode
9.9 years ago
Michael 55k

Did you try blastp with -task blastp-short against all human proteins?

Are exactly matching (if so, how do you know)?

ADD COMMENT
0
Entering edit mode
9.9 years ago
Prasad ★ 1.6k

this might not be the best and easiest method but this could serve the purpose. Based on the example, the data is from uniprot. You just map the first column in uniprot and get the sequences. write a simple perl code for string match (peptide against sequence you have obtained)

ADD COMMENT
0
Entering edit mode

Could you please help me in mapping the protein names in uniprot to get sequences of all the protein! How can I do that!

ADD REPLY
0
Entering edit mode

If I understand correctly, you want to download the protein sequences for a list of UniProt protein names? If so, have a look at this thread: How To Find Sequences For Protein Names (A Challenge)

ADD REPLY
0
Entering edit mode

In uniprot there is a batch submission. You enter the 1st column (protein names) [Though uniprot removes duplicates, I would suggest you to remove duplicates if you have really huge list], then download just the sequence file. below code might be useful

$file=shift;
$file1=shift;

$/=">";

open JJ, $file or die "$file - $!\n";

foreach (<JJ>)
{
    chomp;
    next unless (($id,$seq) = /(.*?)\n(.*)/s);
    $seq=~s/[\s\d\W]//g; #remove digits, spaces, line breaks,...
    my @temp=split(" ",$id);
    $temp[0]=~m/(\w+$)/;
    $hash{$1}=$seq;
}
close JJ;
$/="\n";

open HH, $file1;
foreach (<HH>)
{
    chomp;
    my @mapp=split("\t", $_);
    $mapp[0]=~s/\s+//g;
    $start= index($hash{$mapp[0]}, $mapp[1], 0);
    next if ($start < 0);
    print $mapp[0],"\t",$mapp[1], "\t", $start+1,"\t", $start+1+length($mapp[1]),"\n";
}

close HH;

This gives first occurrence of peptide, if you want all, just change the offset in index function

ADD REPLY
0
Entering edit mode

Always use Bio-packages. Much less hackier than using > as record separators. Also, this code does't deal with the fact that each ID has multiple sequences and each sequence from the FASTA file might have multiple matches - though you mention changing the offset, it would involve the code being nested in a loop. Multiple matches is much easier using the regex match array.

ADD REPLY

Login before adding your answer.

Traffic: 1995 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6