Hi Guys,
I have done some differential expression analysis on some RNA-seq data (counts were mapped to the genome) and I am satisfied with the results (examined briefly by looking for the downregulation of my knockout gene and its associated interactome). I then attempted to put together a spreadsheet with both my DESeq2 data and gene annotations from BioMart so people interested in this dataset could examine the results too.
This was what I did for my BioMarts annotation:
ensembl = useDataset("mmusculus_gene_ensembl", mart = ensembl)
attributeNames <- c("ensembl_gene_id", "entrezgene_id", "external_gene_name", "description", "chromosome_name", "start_position", "end_position", "strand")
ourFilterType <- "ensembl_gene_id"
filterValues <- rownames(Day4_CONvsCRE)
full_mm10_annot_Day4_CONvsCRE <- getBM(attributes = attributeNames,
filters = ourFilterType,
values = filterValues,
mart = ensembl)
I then merged the DESeq2 data with my wanted annotations with a few changes:
newcolnames <- c("GeneID", "Entrez", "Symbol", "Description", "Chr", "Start", "End", "Strand")
colnames(full_mm10_annot_Day4_CONvsCRE) <- newcolnames
Day4_CONvsCRE_table <- as.data.frame(Day4_CONvsCRE) %>%
rownames_to_column("GeneID") %>%
left_join(full_mm10_annot_Day4_CONvsCRE, "GeneID") %>%
rename(log2FC = log2FoldChange, FDR = padj)
write_tsv(Day4_CONvsCRE_table, "/mnt/data/BMOHAMED/Total_RNAseq/MDM2kd_seq/all_samples/differential_expression/Day4_CONvsCRE_Annotated.txt")
However, when I inspected the number of rows I had:
dim(full_mm10_annot_Day4_CONvsCRE)
I got
[1] 21323 8
and when I did:
length(unique(full_mm10_annot_Day4_CONvsCRE$GeneID))
I was expecting the number of rows to be the same but i got 21263. I'm assuming that the reason for this is that I either got multiple enterez IDs for the same gene or have duplicate ensembl gene IDs or both. How do I solve this problem? I wanted to have the enterez IDs because my next step is to do a GSEA and KEGG, and from my (rudimentary) understanding, both require enterez IDs. How do I overcome this many-to-one relationship problem that I have?
Thanks in advance!
After further inspection, I have do have ensembl duplicates but this is because I have multiple enterez IDs for the same ensembl ID ... should I concatenate the multiple enterez IDs. Also, If I just accepted one of the enterez IDs and discarded the duplicates would I lose data? I'm currently under the assumption that since I mapped to the genome, it doesn't really matter for downstream analysis.
Thanks in advance