I have run a series of experiments using DEXSeq to identify differentially expressed exons in my RNA-seq data, and have produced a list of exons of interest:
> exons
feature chr start stop strand
1 ENSMUSG00000009927:E001 9 44407139 44407659 +
2 ENSMUSG00000022836:E034 16 35000273 35002420 +
3 ENSMUSG00000025745:E001 5 30118304 30119425 -
4 ENSMUSG00000028034:E001 3 152210422 152210472 +
5 ENSMUSG00000028656:E001 4 122859047 122860217 -
...
However, because DEXSeq requires a custom GFF file as input (prepared using the bundled script, dexseq_prepare_
annotation.py
), differentially expressed exons are not identified using Ensembl identifiers (e.g., ENSMUSE00001401988
) but rather using DEXSeq's internal numbering (e.g., E001
, E034
). Additionally, the exons output by DEXSeq do not always appear to correspond exactly to DEXSeq exons. Sometimes they represent only a portion of the full exon as annotated in Ensembl.
I am interested in analyzing some of the features of the protein sequences of these exons, e.g. InterPro domain content.
I therefore matched DEXSeq gene/exon combinations (in the column feature
in the example table above) to entries in the DEXSeq GFF file to identify the coordinates of each DEXSeq exon. However, I am having a very difficult time figuring out how to retrieve the protein sequences that correspond to these coordinates.
I tried making a TxDb
object (from the GenomicFeatures
package) to match DEXSeq GFF exons to the original Ensembl GTF exons, but this led to the discovery that DEXSeq exons do not always correspond exactly to complete Ensembl exons. I next tried making a GRanges
object to use getSeq
, but this appears to require a BSgenome
object and there are no BSgenome packages for Ensembl. Finally, I tried retrieving the cDNA sequence from biomaRt
using the following code:
for (i in seq_len(nrow(exons))) {
seq <- getSequence(chromosome = exons$chr[i],
start = exons$start[i],
end = exons$stop[i],
seqType = "cdna",
type = "ensembl_exon_id",
mart = ensembl)
}
However, each call to getSequence
returns more than one sequence and each sequence appears to be associated with multiple exons. Additionally, it requires that biomaRt use the same Ensembl version as my DEXSeq analysis which seems like it will hinder reproducibility.
There must be a simple way to retrieve the DNA/protein sequences corresponding to the genomic coordinates of the DEXSeq exons in Ensembl 89. What am I missing?
Did you manage to solve this ? I have similar problem..