Hi folks,
I am trying to create a table for the distribution of mapped RNA sequenced reads across miRBase v.21 for each isomiRs.
I am using contents from this miRBase.mrd file to generate the reads mapped to isomiRs. I have copied the sample output from miRBase.mrd file generated using miRDeep2.
I have written a perl code to identify the isomirs using miRBase.mrd file.
As per the above output, some of the sequenced reads are present in both "ssc-let-7a-1" and "ssc-let-7c".
For example,
seq_3228221_x43
seq_3218517_x44
seq_2697760_x253
Note: seq_3228221 is ID and x43 refers to 43 times this sequence present in the FASTQ file.
I am aware that miRDeep2's mapper file uses following bowtie command "bowtie -p 1 -f -n 0 -e 80 -l 18 -a -m 5 --best --strata " to generate the alignment file. This command allows the same sequence to map to multiple miRNAs.
While generating the percentage of mapped reads for all miRNAs, do I need to consider only unique reads (seq_2697760_x253 only once) or reads with multiple matches then it becomes 2*253=506 reads?
For reference, this is the miRDeep2's final read count output file for all miRNAs
I think you should count every read only once. It may map to 100 miRNA, but it is still one read. (I understand you are talking about "sequences" and not reads. I assume that "sequences" are just clustered identical reads, thus the situation doesn't change). The best thing, however, would be to do a very small experiment, if you have time. Try to do the analysis with only 1 sequence (mapping in multiple positions). From there, you will easily understand what is the correct way of treating it.
You will need a higher coverage to validate isomiRs, really higher.