I have two sequencing runs:
44_Mm03_WEAd_C2_H3K4me3_F_1_ACAGTG_L007_merged.bam
, and 44_Mm03_WEAd_C2_Input_F_1_CCGTCC_L002_merged.bam
We'll call them Input and H3K4 for short.
I have the FASTQs (unmapped reads), the uBAMs (unmapped reads in BAM with adapters marked), and the BAMs (mapped reads).
The FASTQs of Input: 56,823,632 reads
H3K4: 28,798,582 reads
The uBAM of Input: 56,823,632 reads
H3K4: 28,798,582 reads
The BAM of Input: 56,824,033 reads
H3K4: 28,799,788 reads
So far so good. Here are the mapping rates:
Input:
56824033 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
56456641 + 0 mapped (99.35%:-nan%)
56824033 + 0 paired in sequencing
28412014 + 0 read1
28412019 + 0 read2
54529248 + 0 properly paired (95.96%:-nan%)
56318612 + 0 with itself and mate mapped
138029 + 0 singletons (0.24%:-nan%)
1702848 + 0 with mate mapped to a different chr
1278811 + 0 with mate mapped to a different chr (mapQ>=5)
H3K4:
28799788 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
26885619 + 0 mapped (93.35%:-nan%)
28799788 + 0 paired in sequencing
14399868 + 0 read1
14399920 + 0 read2
26577370 + 0 properly paired (92.28%:-nan%)
26709350 + 0 with itself and mate mapped
176269 + 0 singletons (0.61%:-nan%)
124608 + 0 with mate mapped to a different chr
100145 + 0 with mate mapped to a different chr (mapQ>=5)
Again both good, and seemingly well-mapped. But here's the weird thing: the mapping for Input took 1hr 15min 54s, while the H3K4 sample took just 15min 44s. 1/5th of the time. In fact so quick I'm concerned it didn't even work, despite what flagstat is telling me.
Is this common? Should I be worried?
All mapping is happening at the same time, on the same server, which is no where near half-load for CPU or RAM.
Are you using the cluster or have you copied these to a local disk?
Local disk, although its a small compute server (32cores, 64Gb ram) -- I'm trying to work out how long it might take on my local machine before moving it up to the cluster, and these widely-variable compute times make it tricky (although i just checked maximus and it has like 100 cores free and more bytes of ram than i can parse with my eyes :P - certainly would cut my mapping time to, like, 10% of the time if i used that! I just feel a bit weird using it because no one from Bioinfo. has ever given me their official blessing to use it, told me what constitutes as 'acceptable load', etc. So right now i'm on my own little island trying not to bother anyone :P
But really i'm just blown away by how variable mapping times take. I would have thought it was a straight forward function of number-of-reads. I guess Input is genome-wide and H3K4 is going to be many reads in the sameish locations....? Eh... I should really mark dupes and see if that correlates with mapping time. I wouldn't have thought so, but as today's seminar proves i'm frequently wrong! :)
EDIT: For a full list of the mapping times, stdout/etc, its all on http://log.bio -> view table -> Filters -> Date -> Today. The date-picker dropdown might not show up because I suck at everything web. Really looking forward to finishing this analysis off so i can finish off log.bio -_-
EDIT EDIT: Oh and get a PhD or whatever :P
I'm happy to dole out official blessings in return for an espresso :P Anyway, feel free to pop down to the office, we don't bite and you needn't be stranded.
My guess is also that this has to due with the peculiarities of H3K4. If you're really nervous, just put the samples through bowtie2 and see if you get similar results. multiBamSummary followed by plotCorrelation in deepTools should give telling results.
Hahah - ok, deal! :D
And I think you might be right - its something to do with the reads themselves being easier to map. I will come back to this post in a few days with a plot of "read #" vs "mapping time", with the dots coloured by histone modifications :)