Does anyone know why I would get such a wacky looking insert size metrics histogram?
It is bisulfite reads mapped by Bismark (using Bowtie2) and then insert size metrics were obtained with Picard's CollectInsertSizeMetrics tool.
You have 2x150bp reads that are mostly getting trimmed or mapped incorrectly when the insert size is shorter than read length (and thus adapter contamination is present). You can get a more accurate insert size histogram with BBMerge, which does not do alignment and is relatively unaffected by adapters, but it requires the reads to overlap so it will only go up to around 290bp.
I recommend that you redo adapter-trimming with BBDuk and run mapping again, then look at the insert size histogram.
I know that we have been having trouble with Bismark lately. I think maybe it refuses to mark reads that start at the same position as proper pairs due to some bug. In this case, that means no reads that had adapters trimmed are marked as proper pairs. I'll suggest we try BSMAP instead; looks a lot better. What our methylation people are currently doing is trimming 2bp from the right end of all reads (with BBDuk, that's the "ftr2=2" flag) so that they will no longer fully overlap and Bismark will correctly mark them as proper pairs. But both the trough in your graph and the blue tail indicate incomplete adapter trimming, even if they pass FastQC. If all adapters were correctly trimmed, then I think (not sure) that Bismark would report zero pairs with inserts shorter than read length due to the bug. And the trough indicates that reads with only ~1-10bp of adapters or so are not getting trimmed; this may be too short to be detected by FastQC.
Also, the threads are much easier to navigate if you use "reply" for replies rather than "answer".
As you can see, a lot of Cs do not have any C vs T data for some reason in the BSMAP BAM file (bottom) compared to the Bismark BAM file (above).
None of these end up in the BSMAP methylation results file - it seems to just skip analysis of them.
Furthermore, I often get two side-by-side Cs that have very similar coverage (let's say 12 reads each) and a consistent number of Cs and Ts called for each of them, yet only one of them results in the methylation results file.
Also, is it not appropriate to just filter out the RF oriented pairs?
I would still get more properly oriented pairs remaining than when I do the same with BSMAP.
No, it's not appropriate to filter RF-oriented pairs, because they are not really RF-oriented. They are normal innie pairs with insert size shorter than read length, which one of the tools (Picard, I suppose) to erroneously label them as RF. If you dump them, you will get bias against fragments that for whatever reason are shorter than normal, which for this library constitutes a lot of your data.
I know that we have been having trouble with Bismark lately. I think maybe it refuses to mark reads that start at the same position as proper pairs due to some bug. In this case, that means no reads that had adapters trimmed are marked as proper pairs. I'll suggest we try BSMAP instead; looks a lot better. What our methylation people are currently doing is trimming 2bp from the right end of all reads (with BBDuk, that's the "ftr2=2" flag) so that they will no longer fully overlap and Bismark will correctly mark them as proper pairs. But both the trough in your graph and the blue tail indicate incomplete adapter trimming, even if they pass FastQC. If all adapters were correctly trimmed, then I think (not sure) that Bismark would report zero pairs with inserts shorter than read length due to the bug. And the trough indicates that reads with only ~1-10bp of adapters or so are not getting trimmed; this may be too short to be detected by FastQC.
Also, the threads are much easier to navigate if you use "reply" for replies rather than "answer".
Interesting, we had seen similar oddities with bismark; we found much higher proper pair mapping rates with bwa-meth which seems related to this.
What worries me about BSMAP is this: https://postimg.org/image/4j92iaf4f/
As you can see, a lot of Cs do not have any C vs T data for some reason in the BSMAP BAM file (bottom) compared to the Bismark BAM file (above).
None of these end up in the BSMAP methylation results file - it seems to just skip analysis of them.
Furthermore, I often get two side-by-side Cs that have very similar coverage (let's say 12 reads each) and a consistent number of Cs and Ts called for each of them, yet only one of them results in the methylation results file.
This is why I am exploring Bismark.
Also, is it not appropriate to just filter out the RF oriented pairs? I would still get more properly oriented pairs remaining than when I do the same with BSMAP.
No, it's not appropriate to filter RF-oriented pairs, because they are not really RF-oriented. They are normal innie pairs with insert size shorter than read length, which one of the tools (Picard, I suppose) to erroneously label them as RF. If you dump them, you will get bias against fragments that for whatever reason are shorter than normal, which for this library constitutes a lot of your data.