Entering edit mode
8.3 years ago
camelbbs
▴
710
Hi,
We did RNA sequencing in 6 samples and got the results that the total reads number in those 6 samples were very different. For example:
sampleA1 sampleA2 sampleA3 sampleB1 sampleB2 sampleB3
150000 160000 180000 250000 260000 250000
Do we need to truncate the same number of reads for further analysis? Such as:
sampleA1 sampleA2 sampleA3 sampleB1 sampleB2 sampleB3
150000 150000 150000 150000 150000 150000
This is our sum of RPKM for each sample:
sample s-655605 s-664561 s-665905 ZC1 ZC2 ZC3
total_rpkm 2336029.676 1846496.591 2262622.929 554911.8613 774240.5722 636009.5591
Very different between samples. The sum in S groups are 4 times than that in ZC groups. Anyone know the reason? We used total RNA and rRNA depleted library.
Thanks. Cam
Manually you should not correct them. If its differential expression analysis, the tools for DE analysis will take care of that, called normalisation.
Why the total number of reads are very low ?
Thanks, I just write the number for example. Actually I know the normalization process like DESeq. Asking this question because we found the sum of RPKM in each samples are very different. We speculate the reason is sequencing abundance are different.
It will never be same depth for 2 samples sequenced independently. Thats why we have to do library size ( total reads sequenced) normalisation.But the RPKM already normalises for sequencing depth.
I modified my question, could you take a look again, Thanks.
First question : No, do not alterate your samples by removing reads. The abundances are not linearly measured. For example highly expressed genes tend to be more sequenced than in reality, and low expressed genes are less sequenced than in reality. My advice is to use normalized values. Do not use RPKM (lots of well detailed publications explain this fact), try to use specialised packages like DESeq or EdgeR that will better handle your samples and their differences.
2nd questions : RPKM divides the total number of reads by the size (in kb) of your reads. For example if in any samples you have a large amount of little genes that are expressed, it will divide your number of reads by something<1 that may explain differences.