Question

Translating bulk mutation .bed or .vcf data into aa protein fasta sequences

1

Entering edit mode

9.6 years ago

Julian J ▴ 10

To whom it may concern,

I try to convert dbSNP data into predicted Proteinvariants in fast format (in this case the human gene fus (ENSG00000089280).

In my example, I retrieved the rs numbers from the ncbi dbSNP and exported the data

(inquire: (FUS[Gene Name]) AND pathogenic[Clinical Significance])

to .bed and/or vcf file.

My aim is to generate (prediced) protein sequences (e.g the refSeq sequence, but with the according mutation) from human indel or single point mutations. I try to get my hand on CustomProDB, but my R skills are limited so far.

Example .bed file:

track name=dbSNP_human description="dbSNP Build 142 ()" date="2015-04-19 10:00" taxId=9606 dbSnpBuild=142 URL="http://www.ncbi.nlm.nih.gov/snp" assembly= assemblyAccession=
chr16    31191407    31191408    rs121909667    0    +
chr16    31191417    31191418    rs121909668    0    +
chr16    31191409    31191410    rs121909669    0    +
chr16    31191418    31191419    rs121909671    0    +
chr16    31190397    31190398    rs186547381    0    +
chr16    31191088    31191089    rs267606831    0    +
chr16    31185060    31185061    rs267606832    0    +
chr16    31191426    31191427    rs267606833    0    +
chr16    31191051    31191052    rs387906627    0    +
chr16    31185030    31185031    rs387906628    0    +
chr16    31189157    31189158    rs387907274    0    +

I would be very thankful if someone could help me to generate protein sequences from DNA .bed files (or similar) in the future.

Please shout if I forget important things to mention, or If my question needs to moved to another forum path.

Many thanks,
Julian

R SNP sequence • 2.8k views

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by Julian J ▴ 10

Ram · Answer 1 · 2015-04-22

0

Entering edit mode

9.6 years ago

kautilya ▴ 430

You could try using the following approach:-

Download the homo_sapiens_variation.txt.gz file from Uniprot. This file links an dbSNP id to Uniprot ID of the protein it affects and also gives the position and amino acid change information.
Shortlist this file to the RSIDs relevant to you using R or any other scripting language.
Download the relevant Uniprot proteins using the Uniprot REST API(e.g., for P04217 - http://www.uniprot.org/uniprot/P04217.fasta) and use the information from above mentioned file to edit the affected amino acid to mutated amino acid using a script.

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by kautilya ▴ 430

0

Entering edit mode

Hi, thanks for the answer.

Thanks, The problem is the automated amino change to create the sequences in a batch. I am unfortunately not experienced in scripting, and therefore don't know how to start.

thanks, Julian

ADD REPLY • link 9.6 years ago by Julian J ▴ 10