The sequencing of this ChIP-seq H3K4me3 experiment (human, single-end 50bp) has a problem with the number of reads per sample. The person that did the sequencing did not arrange correctly the samples in the lanes, causing the bias you see in the picture (first, upper one). I am trying to "rescue" it doing downsampling, but wanted to make sure I am doing it correctly without loosing too much coverage that I loose robustness in whatever finding I have later on.
I did the downsampling as described in Ming Tang's blog : https://crazyhottommy.blogspot.com/2016/05/downsampling-for-bam-files-to-certain.html . Here there is an example line of code:
samtools view -s 3.6 -b my.bam -o subsample.bam
I did three downsamplings to see what happens: one to 20 M reads, one to 25M and one to 10M. In the figure below you only see raw data, 25M and 10M in that order (from top to bottom). My questions are:
1) How much you think I can downsample without having an unreasonable coverage? Is 10M ok or given that is single-end more should be used? (for example, in Girdhar et al 2018's schizophrenia chip-seq paper in humans they used 13M but it was paired-end sequencing).
2) If I end up deciding 10M is too little and I go with the 25 or 20M one (please, look at the second plot), you can still see a difference between the samples. Is that difference small enough to continue analysing or you think that's already too much?
3) In case in 2) you consider this difference is too big and that that's not good, you have another suggestion of what I could do?
Thank!
Thank you ! It's good to know (point 1)). Yeah, I will do what you suggest in 2). Answering your question, these numbers are after samtools markdup and filtering out those alignments with MAPQ below 30 (so keeping uniquely mapped reads only). These read numbers you see in the plot I took them from checking those BAMs using samtools flagstat, and then I plotted them in a bar chart.