Question

High Duplication and low coverage RNASeq?

0

Entering edit mode

6.8 years ago

Sharon ▴ 610

Hi All

If I have more than 40% are marked duplicate in my RNASeq and discarded by MuTect2 and another 40% not primary alignments.
This is how the duplication looks from fastqc: https://ibb.co/hEPoJH The figure says 43% seq will remain if deduplicated.

I make a coverage profile and it is here, the genome hg38 assembly is covered only by ~17%. https://ibb.co/d3pr5x This is a sample of the distribution I draw its curve:

0 2644954237
1 86037238
2 173531033
3 38889212
4 77790438
5 23056365
6 37055662
7 14282143
8 19123921
9 9226812
10 10713451

All these bases (2644954237) are covered by zero reads ! I mean I count how many reads cover each base in the reference. I feel this is very low coverage and very high duplication, is this normal with RNASeq or can be explained?

RNA-Seq Duplication Reads Low Coverage • 2.0k views

ADD COMMENT • link updated 6.8 years ago by Devon Ryan 104k • written 6.8 years ago by Sharon ▴ 610

0

Entering edit mode

Have you read this post from author of FastQC?

If data is from a patterned FC then you should start looking to see if you can discern if the duplication is optical or due to excessive amplification. You can use clumpify.sh from BBMap as noted in this post for that: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files

genome hg38 assembly is covered only by ~17%

I don't remember what part of genome is coding but since this RNAseq that number may be appropriate.

ADD REPLY • link 6.8 years ago by GenoMax 147k

0

Entering edit mode

No, I did not, thanks for sharing with me, will try BBMap now. Thanks a lot GenoMax.

ADD REPLY • link 6.8 years ago by Sharon ▴ 610

0

Entering edit mode

Remember to use original raw data when you try to estimate duplication sources. Trimmed/otherwise manipulated will throw off your analysis.

ADD REPLY • link 6.8 years ago by GenoMax 147k

0

Entering edit mode

Ok, thanks. You also mean 17% coverage only in RNASeq is normal?

ADD REPLY • link 6.8 years ago by Sharon ▴ 610

0

Entering edit mode

Can you confirm that by your observation 17% of the whole genome (out of ~3 gb) is covered by at least one read in this dataset?

ADD REPLY • link 6.8 years ago by GenoMax 147k

0

Entering edit mode

Yes, 17% of the bases of the genome is covered by at least one read.

ADD REPLY • link 6.8 years ago by Sharon ▴ 610

score 2 · Accepted Answer · 2018-02-24

2

Entering edit mode

6.8 years ago

Devon Ryan 104k

17% coverage actually seems a bit high, though I guess I've never actually checked this. Keep in mind that protein-coding genes only occupy ~2% of the genome and they're most of what you're sequencing (presuming you're doing standard poly-A enrichment). 40% non-primary alignments is high for RNAseq data, at least for what I deal with. A high duplication rate like you're seeing is expected. After all, you're going to have VERY high coverage of highly expressed genes and many of those reads will be falsely labeled as PCR duplicates. For what it's worth, this is why one doesn't mark duplicates in RNAseq data.

Be careful when doing variant calling from RNAseq data. Variant callers aren't typically written to consider things like allele-specific expression, >1000x differences in expected coverage, or RNA editing.

ADD COMMENT • link 6.8 years ago by Devon Ryan 104k

1

Entering edit mode

1-2% coding genes and 8-9% regulatory content (not sure if all of this would be expressed) seems to be the recent estimate I could find.

ADD REPLY • link 6.8 years ago by GenoMax 147k

0

Entering edit mode

Thanks GenoMax so much. I think it becomes more clear now :)

ADD REPLY • link 6.8 years ago by Sharon ▴ 610

0

Entering edit mode

Thanks Devon so much. Appreciated. Do you mean I should not remove duplicates in variant calling RNASeq? Also Do you have any recommendation on how "careful", I mean I am following GATK pipleine for somatic mutations, do u recommend something else? I am very new to this. Thanks

ADD REPLY • link 6.8 years ago by Sharon ▴ 610

1

Entering edit mode

I think GATK had a beta RNAseq variant pipeline for a while, have a look at that.

ADD REPLY • link 6.8 years ago by Devon Ryan 104k

1

Entering edit mode

While you should not remove PCR duplicates if this is data from HiSeq 3000/4000 or NovaSeq you should check for optical duplicates. Those may need to be removed, if you have a lot of them.