BULK mRNA-seq with UMIs. Does it make sense to run bedtools genomecov?
0
0
Entering edit mode
22 months ago
txema.heredia ▴ 200

Hi,

I am using BULK rna-seq which includes UMI added after cDNA fragmentation, before PCR amplification (see my previous post BULK mRNA-seq with UMIs. Do I need to normalize by gene length? ).

I am trying to analyze the effects of a drug which might affect genes differently depending on their gene length, so I am trying to correct as much as possible the effects of gene length bias.

(I am totally aware of the existence of transcript quantification methods like salmon and kallisto. I am already using/planning to use them in parallel)

My input is: UMI-deduplicated splicing-aware mapped bam files. This should have a number of counts per gene proportional to real mRNA-molecule count gene/transcript length. After UMI-deduplication the gene-lenght bias caused by PCR amplification should* be gone. However, there still exists a gene-length bias driven by the number of fragments in longer genes.

Would it make sense to do the following:

  1. bedtools genomecov -bg .... on each bam file. This should give me the number of UMI-deduplicated reads that overlap in the sequence. Intuitively, if 2 or more (UMI-deduplicated) reads overlap in the same position (for more than X bp), that would suggest that these 2 reads originated from fragments from 2 different mRNA molecules. (Because a single molecule's fragments shouldn't overlap, right?)
  1. Get the highest number of overlapping reads/max coverage of each gene. Using a combination of bedtools intersect ... -wa -wb of the previous result with a .gtf annotation file, followed by bedtools merge ... -o max. This would give me, the highest value of the calculated coverage in each gene --> maximum overlap of reads in a single position within the gene --> X fragments originating from X actual real mRNA molecules detected in that gene.

I understand that this approach will underestimate the actual expression. You could have 10 molecules but happen to sequence only like 7 fragments overlapping in a given position.

Does this sound stupid? Do you have any idea on how to improve it?

PS: Yeah, I am planning to compare these results vs other quantification metrics like whole-gene htseq-count, htseq-count divided by gene length, kallisto quantification aggregating per gene.

RNAseq gene-length normalization • 567 views
ADD COMMENT

Login before adding your answer.

Traffic: 2815 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6