Hello,
I'm trying to use miRNA data from the mirBase database in some of my R pipelines. The data file I'm interested in is the mirna.dat which contains info from all published miRNAs (across multiple species).
One entry within the data file looks like this (output of readLines()
function)
[1] "ID cel-let-7 standard; RNA; CEL; 99 BP."
[2] "XX"
[3] "AC MI0000001;"
[4] "XX"
[5] "DE Caenorhabditis elegans let-7 stem-loop"
[6] "XX"
[7] "RN [1]"
[8] "RX PUBMED; 11679671."
[9] "RA Lau NC, Lim LP, Weinstein EG, Bartel DP;"
[10] "RT \"An abundant class of tiny RNAs with probable regulatory roles in"
[11] "RT Caenorhabditis elegans\";"
[12] "RL Science. 294:858-862(2001)."
[13] "XX"
[14] "RN [2]"
[15] "RX PUBMED; 12672692."
[16] "RA Lim LP, Lau NC, Weinstein EG, Abdelhakim A, Yekta S, Rhoades MW, Burge CB,"
[17] "RA Bartel DP;"
[18] "RT \"The microRNAs of Caenorhabditis elegans\";"
[19] "RL Genes Dev. 17:991-1008(2003)."
[20] "XX"
[21] "RN [3]"
[22] "RX PUBMED; 12747828."
[23] "RA Ambros V, Lee RC, Lavanway A, Williams PT, Jewell D;"
[24] "RT \"MicroRNAs and other tiny endogenous RNAs in C. elegans\";"
[25] "RL Curr Biol. 13:807-818(2003)."
[26] "XX"
...
[57] "XX"
[58] "CC let-7 is found on chromosome X in Caenorhabditis elegans [1] and pairs to"
[59] "CC sites within the 3' untranslated region (UTR) of target mRNAs, specifying"
[60] "CC the translational repression of these mRNAs and triggering the transition"
[61] "CC to late-larval and adult stages [2]."
[62] "XX"
[63] "FH Key Location/Qualifiers"
[64] "FH"
[65] "FT miRNA 17..38"
[66] "FT /accession=\"MIMAT0000001\""
[67] "FT /product=\"cel-let-7-5p\""
[68] "FT /evidence=experimental"
[69] "FT /experiment=\"cloned [1-3], Northern [1], PCR [4], 454 [5],"
[70] "FT Illumina [6], CLIPseq [7]\""
[71] "FT miRNA 60..81"
[72] "FT /accession=\"MIMAT0015091\""
[73] "FT /product=\"cel-let-7-3p\""
[74] "FT /evidence=experimental"
[75] "FT /experiment=\"CLIPseq [7]\""
[76] "XX"
[77] "SQ Sequence 99 BP; 26 A; 19 C; 24 G; 0 T; 30 other;"
[78] " uacacugugg auccggugag guaguagguu guauaguuug gaauauuacc accggugaac 60"
[79] " uaugcaauuu ucuaccuuac cggagacaga acucuucga 99"
[80] "//"
The data is formatted similar to EMBL data structure which doesn't play nicely with R's base read functions. I tried a gbRecord
EMBL parser function from biofiles
package but it threw an error message saying mandatory fields are not found
. I think, although the mirBase
data is similar to EMBL, it is not structured the same causing the failure here. Do you have a recommendation for ways to deal with this type of data?
Best regards, Atakan