The top part should be highly accurate in all cases; the rest are more data-dependent. In this case you have roughly 94% strandedness but it's hard to say whether that's plus or minus strand (for read 1). Actually, the shellscript has a decent explanation inside - because this tool is a lot more confusing than most, I tried to be fairly explicit instead of just listing the usage flags as usual, but I do try to keep them concise so I'll go into some more detail here.
Block 1: Depth Analysis
Using BBSketch libraries, a sketch (essentially, a subset of kmers) is made of the library. This is a normal - or, I'll refer to it as canonical - sketch, meaning that every kmer is compared to its reverse complement, then whichever is greater (in 2-bit encoding) is used as the canonical version. That is hashed, and the hashcodes are stored, sorted, and the top X hashcodes (in this case, 20000 by default) plus their counts are retained, and that's the sketch. Concurrently, a second sketch is also made, just using the forward kmers. Now we have two arrays each of 20000 hashcodes and counts.
These sketches are compared to each other by walking down the array and seeing if the hashcodes are equal. In randomly-stranded data you would expect roughly 50% of the forward codes to be present in the canonical array - and vice-versa - but it's a different 50%, since there are twice as many forward kmer hashcodes as canonical kmer hashcodes in the code space (since half of the forward kmers would be instead transformed into reverse kmers). For example, AACCT and AGGTT could both be present in the forward sketch, but only AGGTT would be present in the canonical sketch since those are reverse-complements and AGGTT is bigger. So anyway, as you walk down the arrays, you might see AACCT's hashcode in the forward array, and it would get ignored since it's not in the canonical array. If you see AGGTT in the forward array, it would also be present in the canonical array (that's mandatory). Though if you see AGGTT in the canonical array it might not be present in the forward array.
In this example, let's say AGGTT is in both arrays, with a count of 10 in the canonical array and 5 in the forward array. That's (almost) what we would expect from unstranded data - you see the kmer 5 times in the forward orientation and 5 in the reverse orientation. On the other hand, if we saw GGGCC in the canonical array but NOT in the forward array, that means we encountered it 10 times in the reverse orientation and 0 times in the forward orientation (note that we stop walking when we go off the end of either array, so some elements are ignored in one of the arrays).
In the above case, we have 2 pairs of counts: 10 and 5, and 10 and 0. In each case we select the lower of the two counts (5 and 0) and compare them to the average minor allele frequency of 10 coin flips, which is ~3.269368. Why is it not 5? Well... By minor allele frequency, I mean, if you flip a coin 10 times, just take the lesser of the number of heads or tails, so the maximum possible would be 5 (at 6 it would become the major allele). Therefore the minor allele count could be 0, 1, 2, 3, 4, or 5. Since each has a non-zero chance of occurring, the average minor allele frequency MUST be less than 5. I made a convenient table of them in bbmap/resources/minorAlleleCount.txt
so that I don't have to recalculate them. That's based on simulation because using combinatorics is tricky when numbers overflow doubles... if someone wants to post code for an exact answer using combinatorics that would be great too.
Next we make a sum of the actual and expected minor allele counts. In this case we have 2 kmers, each with a total allele count of 10, and minor allele counts of 0 and 5, so the actual sum is 5+0=5 and the expected sum is 3.269368+3.269368=6.538736. At the end, we compare these two numbers, and get a result of:
strandedness=0.5+(1-(minSum/expectedMinSum))*0.5
=0.5+(1-(5/6.538736))*0.5=0.617663
...so the data is ~62% stranded (except that's not accurate because we only used 2 kmer counts).
On average we would expect the minSum
to equal the expectedMinSum
so you end up with ~50% strandedness. In some rare or contrived cases you could have the minSum
greater than expectedMinSum
so it can go below 50% but I use a different formula for that case. That corresponds to unlikely scenarios with loaded dice where you flip a coin 10 times and get a minor count of 5 every single time, indicating some kind of loaded dice that strictly alternate between heads and tails. You can get that result by running reformat.sh in=reads.fq out=rcomp.fq rcomp; cat reads.fq rcomp.fq > both.fq
which ensures exactly the same number of forward and reverse kmers and thus would be reported as 0% stranded. But a normal library will range between 50% and 100%.
So that's the program in a nutshell! But we just know the percent strandedness, not the dominant strand, which is why I added more tests. Also, I thought it would be neat to have a single tool which gleans absolutely all possible data regarding strandedness from a set of reads.
Block 2: Stop Codon Analysis
Here, you have 75bp reads which are very short for this purpose. But theoretically, reads will tend to have a higher concentration of stop codons on the antisense strand than the sense strand, if you calculate it for all 3 frames on each strand and pick the longest. It depends on the organism because the mRNAs might start some distance before the start codon or end after the stop codon, and might even have another gene on the opposite strand, or could be a non-coding rRNA, so it's just a trend, but for metagenome reads I find it works OK with 150bp reads, and even better with merging turned on to make them longer; in fact, for all of these analyses, it's best to use R1 and R2 and add the "merge" flag. If they overlap the merged read will be used; otherwise, only R1 will be used.
In this case, the average ORF length of the longest frame on the plus and minus strands are similar. Also the average stop count is similar. And the average stop count is greater on minus while ORF length is also greater on minus, which is contradictory, so this is not very informative (though possibly filtering out rRNA would help in this analysis; I might add that capability). So even though the result is "minus", I'd ignore it since the numbers are so close and contradictory, plus the reads are so short. GC is mentioned here because it affects stop codon density; specifically, high-GC organisms won't have as many random stop codons in noncoding areas, which reduces the accuracy.
Block 3: Poly-A Tail Analysis
IF you did some kind of poly-A pulldown, AND you didn't poly-A trim the reads, this MIGHT be relevant depending on the organism (and maybe in other cases too). The PolyAFraction is mentioned because when that's low this block is likely irrelevant (although in the metagenomic data I've looked at, it was accurate even when the number was low; also I don't know remember how low it was or how our metagenomic data was prepared, but it did match the other metrics for major strand prediction). PolyAFraction is the fraction of reads that have a poly-A tail or a poly-T start. I'm using minPolyA=6
as the default cutoff for this expression:
if(trailingPolyA>leadingPolyT && trailingPolyA>minPolyA) {polyACount++;}
else if(trailingPolyA<leadingPolyT && leadingPolyT>minPolyA) {polyTCount++;}
Anyway, as I said, it did work for the metagenomic data I tested it on. This only uses r1 (or the merged read) currently so it might work better if I used r2 to look for the trailing polyA instead. Which would be a polyT in that case, of course.
Block 4: Read Gene-Calling Analysis
This should probably be ignored for eukaryotic data. It uses BBTool's prokaryotic gene-caller (callgenes.sh
) to call genes on the plus and minus sides of the read and see which scores better. How well this works depends on how well the organism matches the model... and read length, since gene calling is really hard on 75bp reads, but works a lot better on merged 2x150bp reads. Also, it works far better (and I mean FAR better) if you supply a reference fasta genome which it can use (via the passes=2
flag) to refine the gene model before it tries to do calls on short read. Tn that case it will also give you additional metrics because it can do gene-calling on the reference too. Don't bother feeding it a reference unless you are working with a prok or virus, OR you also supply a gff. Even with a gff this section will (probably) be totally wrong for euks (thought it might work for the organelles) but some other blocks that pop up will be relevant.
Brian, I formatted your post a little, hope you don't mind.
Thank you, that looks better!