After a standard RNA-seq analysis : STAR -> FeatureCounts -> Deseq2, I want to visualize my reads coverage with IGV. Nevertheless, BAM files are not normalized, so visualisation is biased. Indeed, I used the size factor from Deseq2 to normalize my BAM files (https://support.bioconductor.org/p/76712/).
Question : is it acceptable to apply size factor calculated on counted read to BAM files which are aligned reads ?
How I did the job :
# DESeq2 size factor
WT_R=0.8560343;
WT_S=1.1681774;
# Change value for bamCoverage option (counts need to be divided by size factor)
WT_R_div=$(awk -v WT_R=$WT_R 'BEGIN { print (1 / WT_R) }');
WT_S_div=$(awk -v WT_S=$WT_S 'BEGIN { print (1 / WT_S) }');
# Apply size factor to BAMs / BIGWIG with bamCoverage "--scaleFactor" (and assign reads to forward or reverse strand)
for f in WT*.bam;
do scale=$(eval printf "$"${f%.*}"_div");
bamCoverage --bam $f \
--outFileName ${f%.*}"_forward_strand.bigwig" \
--outFileFormat bigwig \
--scaleFactor $scale \
--filterRNAstrand forward \
--binSize 1 \
--verbose;
bamCoverage --bam $f \
--outFileName ${f%.*}"_reverse_strand.bigwig" \
--outFileFormat bigwig \
--scaleFactor $scale \
--filterRNAstrand reverse \
--binSize 1 \
--verbose;
done &>> bamCoverage.verbose;
DESeq2 divides, an easy way in bash to convert the factor to feed into bamCoverage would be, given that the factor is saved as variable
$FACTOR
:Thanks Devon, but that's bothering me to apply size factor (generated with counted reads) on aligned reads. The ratio aligned reads/counted reads is not always the same in my samples. Say differently, I don't lose the same amount of reads in my samples from aligned reads to counted reads. If Deseq2 would've created size factor using aligned reads I will be OK to use it on aligned reads, but it's not.
If you want to be totally clean about that, you could remove those reads from the BAM file that are not being taken into consideration for the read count matrix. Not sure if featureCounts can tell you the read IDs that it's using, but since it comes with an aligner, maybe there is a way to retrieve that information.
Irrespective of that, I think as long as you remove the presumably biggest chunk of reads you're ignoring (i.e., those not overlapping with exons, perhaps duplicates), you're already playing it fairly safe. After all, I thought you mainly want to do it for visualization purposes, not for statistical calculations, right?
The counted reads are just those aligned with a minimal MAPQ to remove obvious amounts of multimapping. You can apply a reasonable MAPQ threshold in deepTools (e.g.,
--minMappingQuality 4
would work for STAR and match what featureCounts is doing by default) and then the reads used are mostly the same (the main difference being those not overlapping genes).I'm trying to do something similar. I figured you may be able to answer Devon. Will the
--scaleFactor
option accept fractions? As in--scaleFactor 1000000/20000000
.One other question. The documentation for
--normalizeUsing
says this:Does this mean it's not possible to output raw counts per bin?
You can't use expressions for the scaling factor, you'll need to put the resulting value in.
I need to update the documentation regarding what
None
does. It will produce raw coverage metrics with no adjustment by default, which is what it sounds like you want.