I have a data frame with gene names like this:
> test
A
1 mmu-miR-181a-5p
2 mmu-miR-181b-5p
3 mmu-miR-199a-3p__mmu-miR-199b-3p
4 mmu-miR-669o-3p__mmu-miR-669a-3p
5 mmu-miR-669d-5p
6 mmu-miR-103-3p
I truncate the names as follows, to be able to match the them with miRbase IDs:
> test$A <- gsub( "-3p*$", "", test$A)
> test$A <- gsub( "-5p*$", "", test$A)
> test
A
1 mmu-miR-181a
2 mmu-miR-181b
3 mmu-miR-199a-3p__mmu-miR-199b
4 mmu-miR-669o-3p__mmu-miR-669a
5 mmu-miR-669d
6 mmu-miR-103
Now I would like to use a biomaRt and find the ensembl IDs for the genes, but the match fails to find a match:
> ensembl = useMart(biomart = "ensembl", dataset = "mmusculus_gene_ensembl")
> genemap <- getBM( attributes = c("ensembl_gene_id", "gene_biotype", "external_gene_name","mirbase_id" ,"mirbase_trans_name"),
+ mart = ensembl )
> idx <- match(test$A, genemap$mirbase_id )
> idx
[1] NA NA NA NA NA NA
Out of this list, mmu-mir-669d should give a match but it doesn't. This is just an example - out of a complete lists I got about 16 matches, while I was expecting hundreds.
I was thinking of spaces generated by the gsub
function, but there are no spaces. It's likely stupid errorn but where? Any educated guesses will be welcome...
Thanks a bunch for a comprehensive reply and teaching me useful regular expressions. What a shame, was (apparently) too tired to tell that there was a capital R in the query :)
No problem! I cannot be sure about the case sensitivity of the
getBM
function, but the other parts that I modified in the micro-RNA names are important!