As per this question I asked last week (Biostar Question), to figure out which is the dominant arm on mirbase I can either look at the name in the 'Previous ID' field and look for the * which indicates the non-dominant arm or I can look at the read count and the dominant arm is the one with more reads. I have to do this for many mirnas and I obviously cannot do it myself one by one. I tried to look at the data in the download section of mirBase but I can't seem to find what I need. For example for miR-373 this is the entry in the miRNA.dat file:
----------
ID hsa-mir-373 standard; RNA; HSA; 69 BP.
AC MI0000781;
DE Homo sapiens miR-373 stem-loop
DR TARGETS:PICTAR-VERT; hsa-miR-373; hsa-miR-373.
DR TARGETS:PICTAR-VERT; hsa-miR-373*; hsa-miR-373*.
DR HGNC; 31787; MIR373.
DR ENTREZGENE; 442918; MIR373.
FH Key Location/Qualifiers
FH
FT miRNA 6..27
FT /accession="MIMAT0000725"
FT /product="hsa-miR-373-5p"
FT /evidence=experimental
FT /experiment="cloned [1]"
FT miRNA 44..66
FT /accession="MIMAT0000726"
FT /product="hsa-miR-373-3p"
FT /evidence=experimental
FT /experiment="cloned [1-2], Northern [1]"
SQ Sequence 69 BP; 10 A; 13 C; 22 G; 0 T; 24 other;
gggauacuca aaaugggggc gcuuuccuuu uugucuguac ugggaagugc uucgauuuug 60
ggguguccc 69
----------
I can see the sequence of the stem-loop and of the coordinates to find the -3p and -5p arms but no information about which is the dominant arm.
The other downloadable data in mirBase are fasta files and files relative to differences from past releases so I don't think they are useful.
Am I looking in the wrong place in mirBase or should I look somewhere else to find this information and be able to extract It for all mirnas with a script?
I am using python, if there isn't an easy way I could probably figure something out with modules like 'beautiful soup' or something similar but it seems very weird to me that there isn't a smarter way to do it.
This may be intentional as I think mirBase isn't too keen on assigning star sequences (rather than 5p/3p) because they are sometimes wrong and the antisense can be more important anyway. It seems like you are trying to assign ambiguous aggregate hairpin counts to specific mature miRNAs, which isn't possible.
From my understanding star sequences are outdated (which is why you look at the previous ID and not the present one) but they still work to find out which is the non dominant arm, I'd rather look at the read count which seems better but in the downloadable files on mirBase I can't find either.
If a file only has hsa-miR-373 in the name I need to understand if it refers to the 3p or the 5p, in this case the 3p is the dominant one and my assumption would be that the file refers to that one but I need to be able to do it for many mirnas I can't look them up one by one on mirBase.
I opened a ticket here. I agree this should be easier than it is presently.
Not exactly what I wanted but thanks. I was trying to avoid using R since python is easier but it looks like it has better tools for some of this stuff.