I am working on the automatic download of proteins to calculate their mass and the mass of the different subregions. I was wondering if there was a tool to help me with this or would I have to program it from scratch?
I can receive as an output a fasta file from NCBI or GenBank flat file (as well as other formats). The fasta contains no information about the regions. The relevant part of the genebank file looks like this:
** ##Evidence-Data-END##
FEATURES Location/Qualifiers
source 1..230
/organism="Mus musculus"
/strain="NOD"
/db_xref="taxon:10090"
/chromosome="18"
/map="18"
Protein 1..230
/product="endothelial cell-specific chemotaxis regulator"
/note="endothelial cell-specific molecule 2; apoptosis
regulator through modulating IAP expression"
/calculated_mol_wt=24341
Region 134..228
/region_name="ECSCR"
/note="Endothelial cell-specific chemotaxis regulator;
pfam15820"
/db_xref="CDD:292448"
CDS 1..230
/gene="Ecscr"
/gene_synonym="1110006O17Rik; ARIA"
/coded_by="NM_001033141.1:82..774"
/db_xref="CCDS:CCDS37763.1"
/db_xref="GeneID:68545"
/db_xref="MGI:MGI:1915795"
ORIGIN
1 mlrdisleah glgstltpll ahqlpqgrvr gyssqptttq tsqeilqkss qvslvsnqpv
61 tprsstmdkq slslpdlmsf qpqkhtlgpg tgtperssss ssssssrrge asldatpspe
121 ttslqtkkmt illtilptpt sesvltvaaf gvisfivilv vvviilvsvv slrfkcrknk
181 esedpqkpgs sglsescsta ngekdsitli smrninvnns kgsmsaekil
//
**
So in theory I can extract the region from this file using some text mining and parse the fasta. Since that would take sometime I figured I would post and see if anyone had a better solution
If you don't have to work with these files, you could use EnsEMBL's API to extract this kind of information from the database. I think the protein molecular weight is available. You can also compute the mass of any peptide as the sum of the masses of the amino-acid residues (plus water). There are also plenty of online tools for this.
I do not have to work with these files, no. The key point is automation only. I just need to be able to feed a list of protein names and receives the MW of the whole protein and all its subregions. Doesn't matter what I use to achieve that, since it will just be used to compare the an MALDI output. I will take a look at that API, thanks! Also I indeed use an MW compute tool in R. The problem is getting the mass of the subregions!
For MW calculations in automated setting I can recommend the EMBOSS suite of tools.
You can extract the sequences of the regions and compute the masses yourself using a table of masses of amino-acid residues or using the mw() function of the R package Peptides.
That's what I have been doing for the whole protein sequence. Indeed it should not be too hard to extract the region based on the genebank file. I was just wondering if there was some automated way to extract or identify the regions of the protein. I guess I will do it myself. Thanks!