NGS Sequencing depths
0
1
Entering edit mode
7.8 years ago
sliproach ▴ 10

Hello,

I have an issue that I cannot resolve, as I am new to this type of analysis.

The NGS sequencing DNA samples I have are with different sequencing depths... Is there any way to equalise all sequencing depths at the level of .bam or .bed files to a particular number of reads, e.g. 8 000 000?

Thank you.

next-gen sequencing gene • 1.9k views
ADD COMMENT
0
Entering edit mode

When sequencing you are sampling from a library so what you see is representative of what is present in that library. You would need to selectively throw data away to achieve what you want. What is a use case to justify doing this?

BTW: Are you referring to depth = 8000000 as in total reads in the file or covering any base location?

ADD REPLY
0
Entering edit mode

I am referring to 8 000 000 as a total number of counts. The problem is that the samples are from three different batches, and one of them is with much lower sequencing depths than the others. On a PCA plot of component 1 and component 2, this particular batch is clearly separated by the other ones. I have tried different approaches to fight this, and my supervisor suggested reducing the number of total reads. I am open to ideas.

ADD REPLY
3
Entering edit mode

You can use reformat.sh from BBMap suite to do this. Something like reformat.sh in=your.bam out=sampled.bam srt=8000000. You may want to set sampleseed=N (+ number) for deterministic sampling.

Edit: You could also sample reads up-front (using something like reformat.sh in1=file_R1.fq.gz in2-file_R2.fq.gz out1=samp_R1.fq.gz out2=samp_R2.fq.gz srt=8000000), if that is more suitable. That would not ensure exactly 8 million alignments in output files.

ADD REPLY
0
Entering edit mode

What did you cluster in the PCA?

Please use ADD REPLY on the correct reaction, as such this thread remains logically structured and easy to follow

ADD REPLY
0
Entering edit mode

pca is an unsupervised method. so in this instance it's just the two primary components which explain the largest amount of variability in the samples?

ADD REPLY
0
Entering edit mode

Yes, but what's the input? Variants, read counts,...?

ADD REPLY
0
Entering edit mode

sorry, my bad, a misinterpretation on my end.

ADD REPLY
0
Entering edit mode

Remove them from analysis. That or you can downsample the more deeply sequenced librarys to a level consistent with the problematic batch. If you still see clear seperation on PCA after doing this, then it's likely a batch effect issue, and the samples should be discarded.

ADD REPLY
0
Entering edit mode

You could even look into doing scale and quantile normalizations. This is commonly used in meta-analysis of chip data from different platforms or signal channels. Though it kinda feels like cheating in some sense for RNAseq.

ADD REPLY
0
Entering edit mode

OP wrote:

The NGS sequencing DNA samples

so doesn't sound like RNA-seq, although unclear what he's sequencing..

ADD REPLY
0
Entering edit mode

ah, i missed that. my bad.

ADD REPLY
0
Entering edit mode

Thank you for the suggestion. I will try it out.

ADD REPLY
0
Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

ADD REPLY
1
Entering edit mode

I asked Istvan to have this reaction as a standard moderation option and he will do that in the next update of the website. I get the feeling about half of our posts are this sentence ;)

ADD REPLY
0
Entering edit mode

Do you mean that you want to normalize as if you had 8 000 000 million reads mapped ? If so, you can use : --scaleFactor options from deepTools to normalized 8000000/number_of_reads_mapped. You can calculate the number of reads mapped using samtools flagstat on the .bam file.

ADD REPLY

Login before adding your answer.

Traffic: 6014 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6