Question

interpreting GC-bias compute and correct results from DeepTools

1

Entering edit mode

4.8 years ago

Anand Rao ▴ 640

Greetings!

I want to check whether my BAM file suffers from GC-bias, especially because this is a relatively very old sample, with the following details:

Tissue Extraction Kit = GeneAll HybridR+ kit (GeneAll)
RNA Extraction Kit = TruSeq RNA sample preparation kit
Platform = Illumina HiSeq 2000
Sequencing Kit = TruSeq SBS kit v3-HS

Based on several posts here on BioStars and elsewhere online, I chose to use

ComputeGCBias from deepTools, and
CorrectGCBias, also from deepTools.

I've generated plots for 4 different combos:

before vs after correcting for GC-bias,
using either the full unaltered genome assembly or the repeat masked genome assembly.

I seek your help with making sure my interpretation of these results are correct, so here is a composite image of the plots.

If you can confirm / refute my statements, and answer my questions, with any relevant links and commentary, I'd be most obliged. Thanks in advance!

enter image description here

There is GC-bias in my input sample (red font marked plots), as seen in the bottom sub-panels, whether it be using the full genome or the repeat-masked genome.
GC-bias correction works as expected in terms of smoothing log2(observed/expected) Vs GC% in the corrected output (green font plots)
Post-GCbias correction, it looks like for full genome case, reads in the >60% GC-range have been down-sampled.
I suppose it does not make sense to examine reads against the masked version of the genome because of the possibility that expression can truly be from repeat regions, and other reason(s) ?
Should I use GC-bias corrected BAM file with featureCounts to report gene expression as count data?
OR Is this step ONLY to examine and acknowledge if there is GC-bias, but use other preferred methods to compensate for GC-bias?
If latter, then which bioinformatic protocol(s) to follow for factoring in GC-bias at a later stage in my analyses?

GC-bias deepTools • 4.5k views

ADD COMMENT • link 4.8 years ago by Anand Rao ▴ 640

0

Entering edit mode

4.8 years ago

Anand Rao ▴ 640

Please also see response to a variant of this question at BioConductor from Michael Love at https://support.bioconductor.org/p/132192/

ADD COMMENT • link 4.8 years ago by Anand Rao ▴ 640

ATpoint · Accepted Answer · 2020-06-28

5

Entering edit mode

4.8 years ago

ATpoint 87k

The deeptools implementation for GC bias is the one from Benjamini & Speed and intended for DNA-seq, therefore I think not applicable here. Instead I suggest you check the Bioconductor package alpine for GC bias in RNA-seq or (my recommendation) use salmon to quantif your RNA-seq data against the reference transcriptome using the --gcBias flag which ~~implements the principle of that package~~ implements a GC bias correction method directly into the quantification process. That would correct a potential GC bias without that you have to do anything beyond setting that flag.

The idea behind alpine is explained here by the author Mike Love:

Edit: I actually thought Alpine and the GC bias impementation in salmon was identical, apparently that was wrong, see https://mikelove.wordpress.com/2016/09/26/rna-seq-fragment-sequence-bias/ for a comment from Mike Love towards the differences between Alpine and the Salmon implementation. In any case, don't use deeptools, but a dedicated RNA-seq method.

Edit2: Rob Patro (salmon author) confirmed in our slack that the salmon GC bias method is fine with either stranded and unstranded data, but paired-end data are currently required.

ADD COMMENT • link 4.8 years ago by ATpoint 87k

0

Entering edit mode

Indeed, I had not realized that Benjamini & Speed algorithm applies only to DNA-Seq. Thanks for pointing that out and for your recommendation for salmon using the --gcBias flag.

So this means computeGCBias itself is not valid for RNA-Seq data, let alone correctGCBias - am I right?

Edit: Salmon requires paired-end data.

ADD REPLY • link updated 4.8 years ago by ATpoint 87k • written 4.8 years ago by Anand Rao ▴ 640

0

Entering edit mode

I guess not, though I am not at all an expert in that matter. The DNA-seq method (from what I take from the video) uses a sliding window approach (probably across the genome), but as in RNA-seq we have coverage mostly in exons, few or absent in introns and essentially no coverage in intergenic regions so this probably messes up the assumptions of that tool. I guess currently salmon is the go-to for this task in RNA-seq. In any case the salmon selective alignment approach typically outperforms traditional alignment for RNA-seq quantification, check the recent benchmarking papers towards that.

ADD REPLY • link 4.8 years ago by ATpoint 87k

0

Entering edit mode

From this ALPINE link, under Description:

It is currently designed for un-stranded pairedend RNA-seq data.

My data is un-stranded BUT single end RNA-Seq data - so it looks like I cannot use ALPINE? Is it because using SE data would violate the underlying (statistical) assumptions of PE data, just curious...

About SALMON compatibilty with SE vs PE Illumina data, I found this useful excerpt from SALMON README online,

If you are using single-end reads, then you pass them to Salmon with the -r flag like:

./bin/salmon quant -i transcripts_index -l <LIBTYPE> -r reads.fq --validateMappings -o transcripts_quant

So using SALMON should not be a problem for my SE data.

However, for visualization and verification whether GC-bias exists in SE RNA-Seq data, can I use SALMON instead of ALPINE? Or is bias visualization possible only via ALPINE? But since it won't work for my SE data, what tool(s) could I use to visualize bias in my SE RNA-Seq data? Any advice?

PS. Just emailed Prof. Michael Love, his inputs should be decisive. He may answer here, or I'll share contents of his reply if I receive one.

ADD REPLY • link 4.8 years ago by Anand Rao ▴ 640

1

Entering edit mode

I am not aware that it was for unstranded data, the salmon manual does not mention this explicitely but I pinged the developer Rob Patro in our Slack, lets see what he says. Yes PE is necessary. By the way, please consider not to email developers (as many of them explicitely request) and rather open a question at support.bioconductor.org where many of the Bioc developers including ML are very active and helpful. It is also not standard to correct GC bias, many pipelines do not do this, and if you do not have PE data then why not considering to simply not do it?

ADD REPLY • link 4.8 years ago by ATpoint 87k

0

Entering edit mode

Thanks for that suggestion, I'd never posted there before, but I just did. Hopefully it's not flagged for cross-posting.

You said

Edit2: Rob Patro (salmon author) confirmed in our slack that the salmon GC bias method is fine with either stranded and unstranded data, but paired-end data are currently required.

I'm really confused now, because at this SALMON link, it clearly says otherwise:

Quantifying in mapping-based mode Then, you can quantify any set of reads (say, paired-end reads in files reads1.fq and reads2.fq) directly against this index using the Salmon quant command as follows:

./bin/salmon quant -i transcripts_index -l <LIBTYPE> -1 reads1.fq -2 reads2.fq --validateMappings -o transcripts_quant

If you are using single-end reads, then you pass them to Salmon with the -r flag like:

./bin/salmon quant -i transcripts_index -l <LIBTYPE> -r reads.fq --validateMappings -o transcripts_quant

Could you please clarify with author Rob Patro? Sorry for the bother...

ADD REPLY • link 4.8 years ago by Anand Rao ▴ 640

1

Entering edit mode

For salmon quantification in general: single- and paired-end data are ok.

For GC-bias correction with salmon: only paired-end data are supported, but it does not matter whether data are stranded or not.

ADD REPLY • link 4.8 years ago by ATpoint 87k

1

Entering edit mode

OK, I see what you mean - SE OK for SALMON based quantification, but not for correction. Are there any tools that can visualize and/or correct GC-bias in SE data? May be not, which is probably why you said earlier -

if you do not have PE data then why not considering to simply not do it?

ADD REPLY • link 4.8 years ago by Anand Rao ▴ 640

score 4 · Accepted Answer · 2020-06-28

4

Entering edit mode

4.8 years ago

Devon Ryan 105k

Since I'm a deepTools author, I'll echo ATpoint and suggest not using computeGCBias for RNA-seq data. This is intended ONLY for old ChIP-seq and similar data and shouldn't be used in any other context.