Question

Reading EMBL-like miRBase data into R

0

Entering edit mode

4.9 years ago

atakanekiz ▴ 310

Hello,

I'm trying to use miRNA data from the mirBase database in some of my R pipelines. The data file I'm interested in is the mirna.dat which contains info from all published miRNAs (across multiple species).

One entry within the data file looks like this (output of readLines() function)

[1] "ID   cel-let-7         standard; RNA; CEL; 99 BP."                               
  [2] "XX"                                                                              
  [3] "AC   MI0000001;"                                                                 
  [4] "XX"                                                                              
  [5] "DE   Caenorhabditis elegans let-7 stem-loop"                                     
  [6] "XX"                                                                              
  [7] "RN   [1]"                                                                        
  [8] "RX   PUBMED; 11679671."                                                          
  [9] "RA   Lau NC, Lim LP, Weinstein EG, Bartel DP;"                                   
 [10] "RT   \"An abundant class of tiny RNAs with probable regulatory roles in"         
 [11] "RT   Caenorhabditis elegans\";"                                                  
 [12] "RL   Science. 294:858-862(2001)."                                                
 [13] "XX"                                                                              
 [14] "RN   [2]"                                                                        
 [15] "RX   PUBMED; 12672692."                                                          
 [16] "RA   Lim LP, Lau NC, Weinstein EG, Abdelhakim A, Yekta S, Rhoades MW, Burge CB," 
 [17] "RA   Bartel DP;"                                                                 
 [18] "RT   \"The microRNAs of Caenorhabditis elegans\";"                               
 [19] "RL   Genes Dev. 17:991-1008(2003)."                                              
 [20] "XX"                                                                              
 [21] "RN   [3]"                                                                        
 [22] "RX   PUBMED; 12747828."                                                          
 [23] "RA   Ambros V, Lee RC, Lavanway A, Williams PT, Jewell D;"                       
 [24] "RT   \"MicroRNAs and other tiny endogenous RNAs in C. elegans\";"                
 [25] "RL   Curr Biol. 13:807-818(2003)."                                               
 [26] "XX"                                                                              
 ...                                           
 [57] "XX"                                                                              
 [58] "CC   let-7 is found on chromosome X in Caenorhabditis elegans [1] and pairs to"  
 [59] "CC   sites within the 3' untranslated region (UTR) of target mRNAs, specifying"  
 [60] "CC   the translational repression of these mRNAs and triggering the transition"  
 [61] "CC   to late-larval and adult stages [2]."                                       
 [62] "XX"                                                                              
 [63] "FH   Key             Location/Qualifiers"                                        
 [64] "FH"                                                                              
 [65] "FT   miRNA           17..38"                                                     
 [66] "FT                   /accession=\"MIMAT0000001\""                                
 [67] "FT                   /product=\"cel-let-7-5p\""                                  
 [68] "FT                   /evidence=experimental"                                     
 [69] "FT                   /experiment=\"cloned [1-3], Northern [1], PCR [4], 454 [5],"
 [70] "FT                   Illumina [6], CLIPseq [7]\""                                
 [71] "FT   miRNA           60..81"                                                     
 [72] "FT                   /accession=\"MIMAT0015091\""                                
 [73] "FT                   /product=\"cel-let-7-3p\""                                  
 [74] "FT                   /evidence=experimental"                                     
 [75] "FT                   /experiment=\"CLIPseq [7]\""                                
 [76] "XX"                                                                              
 [77] "SQ   Sequence 99 BP; 26 A; 19 C; 24 G; 0 T; 30 other;"                           
 [78] "     uacacugugg auccggugag guaguagguu guauaguuug gaauauuacc accggugaac        60"
 [79] "     uaugcaauuu ucuaccuuac cggagacaga acucuucga                               99"
 [80] "//"

The data is formatted similar to EMBL data structure which doesn't play nicely with R's base read functions. I tried a gbRecord EMBL parser function from biofiles package but it threw an error message saying mandatory fields are not found. I think, although the mirBase data is similar to EMBL, it is not structured the same causing the failure here. Do you have a recommendation for ways to deal with this type of data?

Best regards, Atakan

embl mirdb R parse • 1.1k views

ADD COMMENT • link 4.9 years ago by atakanekiz ▴ 310