Entering edit mode
10.2 years ago
kvince.888
▴
10
I'm attempting to first generate a data frame of only homo sapiens protein-coding genes using the following code:
library(biomaRt)
ensembl=useMart("ensembl")
ensembl=useDataset("hsapiens_gene_ensembl",mart=ensembl)
filterlist <- list("protein_coding")
protcoding.genes.with.id=getBM(attributes = c("hgnc_symbol","entrezgene", "external_gene_id"),filters = c("biotype"),values = filterlist, mart = ensembl)
I get a data frame with 22710 observations. But I've found that snoRNA entrez gene ids are also included as the protein coding gene that they share a locus with. For example:
quest<-protcoding.genes.with.id[grep(paste("^","GNL3",sep=""), protcoding.genes.with.id$external_gene_id, ignore.case=TRUE),]
View(quest)
row.names hgnc_symbol entrezgene external_gene_id
1 12364 GNL3L 54552 GNL3L
2 15208 GNL3 100113381 GNL3
3 15209 GNL3 26354 GNL3
Row 2 with entrez gene id 100113381 is actually SNORD19B...which maps within one of GNL3's introns.
How can I get only the actual protein coding genes and no snoRNAs?
Thanks!
Interesting; returning the attribute
gene_biotype
gives values of protein_coding for all the results. Maybe it's a quirk of how gene IDs are mapped? I'm sure Emily can explain.