match() function returning NA even when there is match
1
0
Entering edit mode
7.2 years ago

I have a data frame with gene names like this:

> test
                                 A
1                  mmu-miR-181a-5p
2                  mmu-miR-181b-5p
3 mmu-miR-199a-3p__mmu-miR-199b-3p
4 mmu-miR-669o-3p__mmu-miR-669a-3p
5                  mmu-miR-669d-5p
6                   mmu-miR-103-3p

I truncate the names as follows, to be able to match the them with miRbase IDs:

> test$A <- gsub( "-3p*$", "", test$A)
> test$A <- gsub( "-5p*$", "", test$A)
> test
                              A
1                  mmu-miR-181a
2                  mmu-miR-181b
3 mmu-miR-199a-3p__mmu-miR-199b
4 mmu-miR-669o-3p__mmu-miR-669a
5                  mmu-miR-669d
6                   mmu-miR-103

Now I would like to use a biomaRt and find the ensembl IDs for the genes, but the match fails to find a match:

> ensembl = useMart(biomart = "ensembl", dataset = "mmusculus_gene_ensembl")
> genemap <- getBM( attributes = c("ensembl_gene_id", "gene_biotype", "external_gene_name","mirbase_id" ,"mirbase_trans_name"),
+                   mart = ensembl )
> idx <- match(test$A, genemap$mirbase_id )
> idx
[1] NA NA NA NA NA NA

Out of this list, mmu-mir-669d should give a match but it doesn't. This is just an example - out of a complete lists I got about 16 matches, while I was expecting hundreds.

I was thinking of spaces generated by the gsub function, but there are no spaces. It's likely stupid errorn but where? Any educated guesses will be welcome...

RNA-Seq • 3.7k views
ADD COMMENT
1
Entering edit mode
7.2 years ago

Hey,

The match function looks for perfect matches, which, in this scenario, is a good thing because gene annotation can be very difficult and frustrating, with vagueness and ambiguity between different naming systems.

The only issue that you are facing is with the names of the miRNAs for which you are searching. I was able to identify each of your miRNAs in the test data-frame using the following code:

test <- data.frame(c("mmu-miR-181a-5p","mmu-miR-181b-5p","mmu-miR-199a-3p__mmu-miR-199b-3p","mmu-miR-669o-3p__mmu-miR-669a-3p","mmu-miR-669d-5p","mmu-miR-103-3p"))
colnames(test) <- c("A")
test$A <- gsub( "-3p*$", "", test$A)
test$A <- gsub( "-5p*$", "", test$A)

test$A <- gsub("R", "r", test$A)
test$A <- gsub("mmu-mir-181a", "mmu-mir-181a-1", test$A)
test$A <- gsub("mmu-mir-181b", "mmu-mir-181b-1", test$A)
test$A <- gsub("^mmu-mir-[0-9]*[a-z]-[35]p__", "", test$A)
test$A <- gsub("mmu-mir-103", "mmu-mir-103-1", test$A)
test$A <- gsub("mmu-mir-669a", "mmu-mir-669a-1", test$A)

require("biomaRt")
ensembl <- useMart(biomart = "ensembl", dataset = "mmusculus_gene_ensembl")
matches <- getBM(mart=ensembl, attributes=c("ensembl_gene_id", "gene_biotype", "external_gene_name","mirbase_id"), filter="mirbase_id", values=test$A, uniqueRows=TRUE)
matches

     ensembl_gene_id gene_biotype external_gene_name     mirbase_id
1 ENSMUSG00000065553        miRNA           Mir103-1  mmu-mir-103-1
2 ENSMUSG00000065565        miRNA          Mir181a-1 mmu-mir-181a-1
3 ENSMUSG00000065458        miRNA          Mir181b-1 mmu-mir-181b-1
4 ENSMUSG00000092807        miRNA            Mir199b   mmu-mir-199b
5 ENSMUSG00000096583        miRNA          Mir669a-1 mmu-mir-669a-1
6 ENSMUSG00000095699        miRNA            Gm26092 mmu-mir-669a-1
7 ENSMUSG00000077834        miRNA            Mir669d   mmu-mir-669d

You can see that I first tidy up the names of your miRNAs in the second block of my code. For example, the match and getBM functions will never find matches for lookup terms like mmu-miR-199a-3p__mmu-miR-199b or mmu-miR-669o-3p__mmu-miR-669a. In this example, I have actually just eliminated the first miRNA in these 2 lookup terms and only focused on the miRNA after the '__'. For the other miRNAs, I searched for them HERE to see what the official term could be.

You can also see that I used the getBM function differently here.

ADD COMMENT
1
Entering edit mode

Thanks a bunch for a comprehensive reply and teaching me useful regular expressions. What a shame, was (apparently) too tired to tell that there was a capital R in the query :)

ADD REPLY
0
Entering edit mode

No problem! I cannot be sure about the case sensitivity of the getBM function, but the other parts that I modified in the micro-RNA names are important!

ADD REPLY

Login before adding your answer.

Traffic: 1383 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6