Hey guys,
I've done differential expression analysis on an RNA-seq experiment using DESeq2 and I now want to add my BioMart annotations to my DESeq2 results table. I've selected my BioMart database and generated an annotation using:
ensembl <- useMart("ensembl")
datasets <- listDatasets(ensembl)
head(datasets)
ensembl = useDataset("mmusculus_gene_ensembl", mart = ensembl)
attributeNames <- c("ensembl_gene_id", "entrezgene_id", "external_gene_name", "description", "transcript_biotype", "chromosome_name", "start_position", "end_position", "strand", "transcript_length")
ourFilterType <- "ensembl_gene_id"
filterValues <- rownames(results_table)
full_mm10_annot <- getBM(attributes = attributeNames,
filters = ourFilterType,
values = filterValues,
mart = ensembl)
I then filtered out all rows containing duplicated ensembl ID's:
duplications_bioMart <- full_mm10_annot %>%
add_count(ensembl_gene_id) %>%
filter(n>1)
From what I can tell, the ensembl duplications seem to be coming from transcript_biotype.
head(duplications_bioMart)
A tibble: 6 x 11 ensembl_gene_id entrezgene_id external_gene_n… description transcript_biot… chromosome_name start_position
end_position strand <chr> <int> <chr>
<chr> <chr> <chr> <int>
<int> <int> 1 ENSMUSG0000002… 170755 Sgk3
serum/gluc… protein_coding 1 9798107
9900845 1 2 ENSMUSG0000002… 170755 Sgk3
serum/gluc… nonsense_mediat… 1 9798107
9900845 1 3 ENSMUSG0000002… 170755 Sgk3
serum/gluc… protein_coding 1 9798107
9900845 1 4 ENSMUSG0000002… 170755 Sgk3
serum/gluc… retained_intron 1 9798107
9900845 1 5 ENSMUSG0000002… 170755 Sgk3
serum/gluc… nonsense_mediat… 1 9798107
9900845 1 6 ENSMUSG0000002… 170755 Sgk3
serum/gluc… nonsense_mediat… 1 9798107
9900845 1… with 2 more variables: transcript_length <int>, n <int>
what is the gold standard in RNA-seq analysis for dealing with these ensembl ID duplicates. My first thought was to take the protein coding transcript_biotypes only but I'm not sure if I'll be losing information (or gaining unecessary information). What does one do in these kinds of scenarios?
Library prep: Illumina Tru-seq, polyA selection Cheers!
Hi Mike,
Thank you for your reply. My counts and differential expression are both gene based and yes, indeed the specific transcripts based on biotype etc are not needed in my analysis. I simply want to annotate my dataset and add entrez IDs so I can do GSEA/KEGG which from my understanding, need enterez IDs.
In summary, I've changed my filters and now use the following:
I then proceeded to make a spreadsheet for those interested in this dataset, combining the differential expression analysis data and gene annotations etc:
However, at this point, I'm expecting my ensembl_gene_IDs to be completely unique. When I do:
I get 21323:8. But when I do:
I was expecting to get 21323:8, but I actually get 21263:8 suggesting that there are duplicates. I think I'm getting multiple Enterez IDs for the same gene but I'm not entirely sure if that's the case. Whats the best way to filter my results? Should I concatenate multiple enterez IDs or should I just use one enterez ID per gene?