Question

Mapping DEXseq exons back to genomic DNA sequences

1

Entering edit mode

8.1 years ago

htyu1lmzta ▴ 30

I have run a series of experiments using DEXSeq to identify differentially expressed exons in my RNA-seq data, and have produced a list of exons of interest:

> exons
                  feature chr     start      stop strand
1 ENSMUSG00000009927:E001   9  44407139  44407659      +
2 ENSMUSG00000022836:E034  16  35000273  35002420      +
3 ENSMUSG00000025745:E001   5  30118304  30119425      -
4 ENSMUSG00000028034:E001   3 152210422 152210472      +
5 ENSMUSG00000028656:E001   4 122859047 122860217      -
...

However, because DEXSeq requires a custom GFF file as input (prepared using the bundled script, dexseq_prepare_ annotation.py), differentially expressed exons are not identified using Ensembl identifiers (e.g., ENSMUSE00001401988) but rather using DEXSeq's internal numbering (e.g., E001, E034). Additionally, the exons output by DEXSeq do not always appear to correspond exactly to DEXSeq exons. Sometimes they represent only a portion of the full exon as annotated in Ensembl.

I am interested in analyzing some of the features of the protein sequences of these exons, e.g. InterPro domain content. I therefore matched DEXSeq gene/exon combinations (in the column feature in the example table above) to entries in the DEXSeq GFF file to identify the coordinates of each DEXSeq exon. However, I am having a very difficult time figuring out how to retrieve the protein sequences that correspond to these coordinates.

I tried making a TxDb object (from the GenomicFeatures package) to match DEXSeq GFF exons to the original Ensembl GTF exons, but this led to the discovery that DEXSeq exons do not always correspond exactly to complete Ensembl exons. I next tried making a GRanges object to use getSeq, but this appears to require a BSgenome object and there are no BSgenome packages for Ensembl. Finally, I tried retrieving the cDNA sequence from biomaRt using the following code:

for (i in seq_len(nrow(exons))) {
  seq <- getSequence(chromosome = exons$chr[i],
              start = exons$start[i],
              end = exons$stop[i],
              seqType = "cdna",
              type = "ensembl_exon_id",
              mart = ensembl) 
}

However, each call to getSequence returns more than one sequence and each sequence appears to be associated with multiple exons. Additionally, it requires that biomaRt use the same Ensembl version as my DEXSeq analysis which seems like it will hinder reproducibility.

There must be a simple way to retrieve the DNA/protein sequences corresponding to the genomic coordinates of the DEXSeq exons in Ensembl 89. What am I missing?

R RNA-Seq rna-seq sequencing • 2.6k views

ADD COMMENT • link updated 2.9 years ago by osiemen ▴ 30 • written 8.1 years ago by htyu1lmzta ▴ 30

0

Entering edit mode

Did you manage to solve this ? I have similar problem..

ADD REPLY • link 2.9 years ago by osiemen ▴ 30

score 0 · Answer 1 · 2017-08-02

The output of dexseq_prepare_ annotation.py will have the transcript and gene information, to which each exon belongs to. So you can pull the gene name or transcript name of each differential exons and get their fasta sequences. Infact, in your output, you have the gene information. If you want to know the transctipt information, you need to check the flattened file from dexseq_prepare_ annotation.py