Entering edit mode
7.8 years ago
sliproach
▴
10
Hello,
I have an issue that I cannot resolve, as I am new to this type of analysis.
The NGS sequencing DNA samples I have are with different sequencing depths... Is there any way to equalise all sequencing depths at the level of .bam or .bed files to a particular number of reads, e.g. 8 000 000?
Thank you.
When sequencing you are sampling from a library so what you see is representative of what is present in that library. You would need to selectively throw data away to achieve what you want. What is a use case to justify doing this?
BTW: Are you referring to
depth = 8000000
as in total reads in the file or covering any base location?I am referring to 8 000 000 as a total number of counts. The problem is that the samples are from three different batches, and one of them is with much lower sequencing depths than the others. On a PCA plot of component 1 and component 2, this particular batch is clearly separated by the other ones. I have tried different approaches to fight this, and my supervisor suggested reducing the number of total reads. I am open to ideas.
You can use
reformat.sh
from BBMap suite to do this. Something likereformat.sh in=your.bam out=sampled.bam srt=8000000
. You may want to setsampleseed=N
(+ number) for deterministic sampling.Edit: You could also sample reads up-front (using something like
reformat.sh in1=file_R1.fq.gz in2-file_R2.fq.gz out1=samp_R1.fq.gz out2=samp_R2.fq.gz srt=8000000
), if that is more suitable. That would not ensure exactly 8 million alignments in output files.What did you cluster in the PCA?
Please use
ADD REPLY
on the correct reaction, as such this thread remains logically structured and easy to followpca is an unsupervised method. so in this instance it's just the two primary components which explain the largest amount of variability in the samples?
Yes, but what's the input? Variants, read counts,...?
sorry, my bad, a misinterpretation on my end.
Remove them from analysis. That or you can downsample the more deeply sequenced librarys to a level consistent with the problematic batch. If you still see clear seperation on PCA after doing this, then it's likely a batch effect issue, and the samples should be discarded.
You could even look into doing scale and quantile normalizations. This is commonly used in meta-analysis of chip data from different platforms or signal channels. Though it kinda feels like cheating in some sense for RNAseq.
OP wrote:
so doesn't sound like RNA-seq, although unclear what he's sequencing..
ah, i missed that. my bad.
Thank you for the suggestion. I will try it out.
Please use
ADD COMMENT/ADD REPLY
when responding to existing posts to keep threads logically organized.I asked Istvan to have this reaction as a standard moderation option and he will do that in the next update of the website. I get the feeling about half of our posts are this sentence ;)
Do you mean that you want to normalize as if you had 8 000 000 million reads mapped ? If so, you can use : --scaleFactor options from deepTools to normalized 8000000/number_of_reads_mapped. You can calculate the number of reads mapped using samtools flagstat on the .bam file.