Mystery: What is going on with this Insert Size Metrics Histogram
4
0
Entering edit mode
7.9 years ago

Hi guys,

Does anyone know why I would get such a wacky looking insert size metrics histogram? It is bisulfite reads mapped by Bismark (using Bowtie2) and then insert size metrics were obtained with Picard's CollectInsertSizeMetrics tool.

https://postimg.org/image/4oxcuo23t/

Histogram

Thanks!!!

alignment insert size picard • 4.1k views
ADD COMMENT
2
Entering edit mode
7.9 years ago

You have 2x150bp reads that are mostly getting trimmed or mapped incorrectly when the insert size is shorter than read length (and thus adapter contamination is present). You can get a more accurate insert size histogram with BBMerge, which does not do alignment and is relatively unaffected by adapters, but it requires the reads to overlap so it will only go up to around 290bp.

I recommend that you redo adapter-trimming with BBDuk and run mapping again, then look at the insert size histogram.

ADD COMMENT
0
Entering edit mode
7.9 years ago

Well, I have done adapter trimming with cutadapt. My FastQC shows no overrepresented sequences (besides telomeric repeats) and no k-mers.

Just wondering, how did you know they were 2x150 - what should the distribution look like?

Also, how do you know they are getting mapped incorrectly? The RF orientation? But in the background there seems to be a very similar FR distribution.

ADD COMMENT
0
Entering edit mode
7.9 years ago

Oh I guess the 150 is evident from that valley. But I did adapter trimming, which causes some reads to be below 150.

ADD COMMENT
0
Entering edit mode
7.9 years ago

Actually, here is the same dataset (same adapter trimming) but aligned with BSMAP instead of Bismark. https://postimg.org/image/blp22owt5/

Why is there such a drastic difference?

Thank you!!

ADD COMMENT
2
Entering edit mode

I know that we have been having trouble with Bismark lately. I think maybe it refuses to mark reads that start at the same position as proper pairs due to some bug. In this case, that means no reads that had adapters trimmed are marked as proper pairs. I'll suggest we try BSMAP instead; looks a lot better. What our methylation people are currently doing is trimming 2bp from the right end of all reads (with BBDuk, that's the "ftr2=2" flag) so that they will no longer fully overlap and Bismark will correctly mark them as proper pairs. But both the trough in your graph and the blue tail indicate incomplete adapter trimming, even if they pass FastQC. If all adapters were correctly trimmed, then I think (not sure) that Bismark would report zero pairs with inserts shorter than read length due to the bug. And the trough indicates that reads with only ~1-10bp of adapters or so are not getting trimmed; this may be too short to be detected by FastQC.

Also, the threads are much easier to navigate if you use "reply" for replies rather than "answer".

ADD REPLY
1
Entering edit mode

Interesting, we had seen similar oddities with bismark; we found much higher proper pair mapping rates with bwa-meth which seems related to this.

ADD REPLY
0
Entering edit mode

What worries me about BSMAP is this: https://postimg.org/image/4j92iaf4f/

As you can see, a lot of Cs do not have any C vs T data for some reason in the BSMAP BAM file (bottom) compared to the Bismark BAM file (above).

None of these end up in the BSMAP methylation results file - it seems to just skip analysis of them.

Furthermore, I often get two side-by-side Cs that have very similar coverage (let's say 12 reads each) and a consistent number of Cs and Ts called for each of them, yet only one of them results in the methylation results file.

This is why I am exploring Bismark.

ADD REPLY
0
Entering edit mode

Also, is it not appropriate to just filter out the RF oriented pairs? I would still get more properly oriented pairs remaining than when I do the same with BSMAP.

ADD REPLY
1
Entering edit mode

No, it's not appropriate to filter RF-oriented pairs, because they are not really RF-oriented. They are normal innie pairs with insert size shorter than read length, which one of the tools (Picard, I suppose) to erroneously label them as RF. If you dump them, you will get bias against fragments that for whatever reason are shorter than normal, which for this library constitutes a lot of your data.

ADD REPLY

Login before adding your answer.

Traffic: 1559 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6