Entering edit mode
2.0 years ago
kng
▴
40
I am trying to extract peptide sequences for each exon for my gene using biomaRt. I managed to extract DNA sequences for each exon but struggled to extract amino acid sequences for each exon separately. Below is the code I used in R. Please advise!
library(biomaRt)
ensembl <- useMart("ensembl")
human_ensembl <- useDataset("hsapiens_gene_ensembl", ensembl)
ensemble_data <- getBM(attributes=c("ensembl_transcript_id", "gene_exon", "ensembl_exon_id", "exon_chrom_start","exon_chrom_end", "rank", "strand", "peptide"),
filters=c("ensembl_transcript_id"),
values="ENST00000380152",
mart=human_ensembl)
Hi! What are you trying to achieve and how are you planing to generate those peptide sequences belonging to the junction of the pair of exons belonging to the same protein? Maybe (not sure of your goal but) would make more sense to extract the full protein sequence belonging to
ENST00000380152
ID and run a sliding window of the length of your desired peptide size?Hi, iraun Thank you for your reply. If the sequence belongs to the intron-exon junction of two exons it can be part of both exons. But I need the peptide sequences for each exon separately. If I remove "peptide" from my attribute list in the above code, I can get DNA sequence for 27 exons for
ENST00000380152
and each is of a different length, so sliding the window approach, as you advised, on the full length of the protein sequence would not help. There must be a smart way to extract those using biomaRt or some other tool?Maybe this helps a bit:
It does however not give you the sequences for each exon seperately, but only for those that are part of an actual translated transcript combined.
If you want to translate any exon no matter what, then
Biostrings::translate()
will probably be of use, if you can somehow keep the reading frame in sync with the help of the exon coordinates incdsAnnot
.Hi kng,
I would advise using the Ensembl REST API to get the exon sequence data: http://rest.ensembl.org/
You can use the Lookup endpoint with the 'expand' optional parameter to get the IDs of the exons (ENSE#) given a transcript ID (ENST#). Then use the Sequence endpoint to retrieve the protein sequences for each of the exons.
Are there more options for customization of the response when using the API directly instead from BioMart?
Hi Matthias, The data available for each of the REST API endpoints can be customised using the optional parameters. The available parameters for each endpoint can be found through the documentation pages. E.g: http://rest.ensembl.org/documentation/info/lookup
When using the REST API, the idea is that you can write scripts (in any language) around the REST API endpoints to pull out specific bits of the output, process it in custom ways and feed it into other platforms.
You can find out more in our online course: https://www.ebi.ac.uk/training/online/courses/ensembl-rest-api/