Hello, I would like assistance understanding the following results below:
What is the fraction that "%unambiguousReads" is out of? Essentially, how do they come up with the fraction. What determines the denominator? In addition, if the fraction is 0.87 does this mean 87%? or 0.87%?
What is the fraction that "%ambiguousReads" is out of? Essentially, how do they come up with the fraction. What determines the denominator?
What does assignedReads mean?
What does assignedBases mean?
What does the MB mean in unambiguousMB/ambiguousMB?
Since you are looking at two E.coli genomes it is not surprising that the % unambiguous reads is very small. No aligner is going to be able to distinguish between very similar genomes of the same species especially when short reads are being used. I am curious as to where the remaining 95% of reads are since they do not seem to be accounted for by these two lines.
I don't know what BBMap does specifically, but typically the denominator is the total number of reads, or the total number of mapped reads, depending on the circumstance.
In this case, it seems that the total number of reads was not reported in the statistics, hence we can't check that assumption.
I would expect that assigned means reads that the read could be mapped (assigned to a location).
I would expect that unambiguous means that a read maps to a single location.
I would expect that ambiguous means that a read maps equally well to more than one location.
In fact output posted by original poster is for bbsplit.sh refstats option. So these result needs to be taken into consideration with the main output of bbsplit.sh run which looks like the bbmap.sh I posted above (bbsplit.sh uses bbmap.sh under the covers to do the read binning). Example of that looks like:
Since you are looking at two E.coli genomes it is not surprising that the % unambiguous reads is very small. No aligner is going to be able to distinguish between very similar genomes of the same species especially when short reads are being used. I am curious as to where the remaining 95% of reads are since they do not seem to be accounted for by these two lines.