Question

Match peptides to protein sequences and tell the position in R

0

Entering edit mode

10.1 years ago

Marie • 0

Hello,

I was wondering if there is a convenient way to figure out the sequence coverage of a given protein with a list of peptides.

For example I have this protein:

>sp|O00330|ODPX_HUMAN Pyruvate dehydrogenase protein X component, mitochondrial OS=Homo sapiens GN=PDHX PE=1 SV=3
MAASWRLGCDPRLLRYLVGFPGRRSVGLVKGALGWSVSRGANWRWFHSTQWLRGDPIKIL
MPSLSPTMEEGNIVKWLKKEGEAVSAGDALCEIETDKAVVTLDASDDGILAKIVVEEGSK
NIRLGSLIGLIVEEGEDWKHVEIPKDVGPPPPVSKPSEPRPSPEPQISIPVKKEHIPGTL
RFRLSPAARNILEKHSLDASQGTATGPRGIFTKEDALKLVQLKQTGKITESRPTPAPTAT
PTAPSPLQATAGPSYPRPVIPPVSTPGQPNAVGTFTEIPASNIRRVIAKRLTESKSTVPH
AYATADCDLGAVLKVRQDLVKDDIKVSVNDFIIKAAAVTLKQMPDVNVSWDGEGPKQLPF
IDISVAVATDKGLLTPIIKDAAAKGIQEIADSVKALSKKARDGKLLPEEYQGGSFSISNL
GMFGIDEFTAVINPPQACILAVGRFRPVLKLTEDEEGNAKLQQRQLITVTMSSDSRVVDD
ELATRFLKSFKANLENPIRLA

And the following peptides:

HSLDASQGTATGPR
STVPHAYATADCDLGAVLK
VVDDELATR

Is it possible that R tells me the start and end of each peptide in the protein of interest in a new file?

Is it also possible to get the fasta sequence directly from uniprot?

I need to do that for many proteins and sequences so I cant do that manually.

Thanks a lot!

R sequence • 6.0k views

ADD COMMENT • link updated 5.8 years ago by Ram 45k • written 10.1 years ago by Marie • 0

0

Entering edit mode

Hello,

I have peptide 30,000 peptide sequences from human brain sample. I want to compare these peptides with the PRIDE database peptide sequence to see whether my peptide sequence is novel or not. I am facing two problems.

Firstly, I have to download the pride dataset in Linux server because my computer doesn't support to download these huge datasets. secondly, how I can compare these in R. Please give me some suggestions.

Shanzida.

ADD REPLY • link 5.8 years ago by jahanshanzida • 0

Ram · Answer 1 · 2015-03-18

The Biostrings package has many facilities for pattern/string matching. The Multiple Alignments and Pairwise Sequence Alignments vignettes would be useful to read through as a starting point.

As for getting sequences straight from UniProt, the answer to that is also yes. The UniProt website has a section on accessing its resources programmatically, which you should read through.

There are many ways to interact with a webservice in R. I recently used their webservice to do ID cross-referencing, and utilized the httr package to do so.

Since you want to fetch FASTA/peptide sequences, your queries won't look like this, but the code tidbit below that maps a series of entrez ids to UniProt accession numbers should help to get you started:

library(httr)
entrez <- c('51692', '1478', '26986')
params <- list(
               from='P_ENTREZGENEID',
               to='ACC', 
               format='tab', 
               query=paste(entrez, collapse=' '))
response <- POST('http://www.uniprot.org/mapping/', 
                 body=params, 
                 encode='multipart')
result <- read.table(textConnection(content(response, 'text')),
                     stringsAsFactors=FALSE,
                     header=TRUE)

and result is

    From         To
1| 51692     G5E9W3
2| 51692     Q9UKF6
3|  1478     P33240
4| 26986 A0A024R9C1
5| 26986     P11940