If I have more than 40% are marked duplicate in my RNASeq and discarded by MuTect2 and another 40% not primary alignments.
This is how the duplication looks from fastqc:
https://ibb.co/hEPoJH
The figure says 43% seq will remain if deduplicated.
I make a coverage profile and it is here, the genome hg38 assembly is covered only by ~17%.
https://ibb.co/d3pr5x
This is a sample of the distribution I draw its curve:
All these bases (2644954237) are covered by zero reads !
I mean I count how many reads cover each base in the reference.
I feel this is very low coverage and very high duplication, is this normal with RNASeq or can be explained?
If data is from a patterned FC then you should start looking to see if you can discern if the duplication is optical or due to excessive amplification. You can use clumpify.sh from BBMap as noted in this post for that: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files
genome hg38 assembly is covered only by ~17%
I don't remember what part of genome is coding but since this RNAseq that number may be appropriate.
17% coverage actually seems a bit high, though I guess I've never actually checked this. Keep in mind that protein-coding genes only occupy ~2% of the genome and they're most of what you're sequencing (presuming you're doing standard poly-A enrichment). 40% non-primary alignments is high for RNAseq data, at least for what I deal with. A high duplication rate like you're seeing is expected. After all, you're going to have VERY high coverage of highly expressed genes and many of those reads will be falsely labeled as PCR duplicates. For what it's worth, this is why one doesn't mark duplicates in RNAseq data.
Be careful when doing variant calling from RNAseq data. Variant callers aren't typically written to consider things like allele-specific expression, >1000x differences in expected coverage, or RNA editing.
Thanks Devon so much. Appreciated. Do you mean I should not remove duplicates in variant calling RNASeq? Also Do you have any recommendation on how "careful", I mean I am following GATK pipleine for somatic mutations, do u recommend something else? I am very new to this. Thanks
While you should not remove PCR duplicates if this is data from HiSeq 3000/4000 or NovaSeq you should check for optical duplicates. Those may need to be removed, if you have a lot of them.
Have you read this post from author of FastQC?
If data is from a patterned FC then you should start looking to see if you can discern if the duplication is optical or due to excessive amplification. You can use clumpify.sh from BBMap as noted in this post for that: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files
I don't remember what part of genome is coding but since this RNAseq that number may be appropriate.
No, I did not, thanks for sharing with me, will try BBMap now. Thanks a lot GenoMax.
Remember to use original raw data when you try to estimate duplication sources. Trimmed/otherwise manipulated will throw off your analysis.
Ok, thanks. You also mean 17% coverage only in RNASeq is normal?
Can you confirm that by your observation 17% of the whole genome (out of ~3 gb) is covered by at least one read in this dataset?
Yes, 17% of the bases of the genome is covered by at least one read.