Entering edit mode
5.9 years ago
Pin.Bioinf
▴
340
Hello,
I want to calculate TMB (mutational load), i know it will not be accurate because its RNA-Seq, but what is the target region generally for RNA-Seq? Similar to WES (30Mb)? We sequenced with 100b length reads.
I need this to divide non synonymous mutations by this region size number.
Thank you
This totally depends on the number of reads and the tissue you're sequencing. What you could do is convert bam to bed, to get all regions with at least one read (or filter on a higher minimum number of reads). Then sum up the size of the intervals to get the size of the transcriptome. You can do all of this using bedtools.
Thank you! could you help me with the command? I am new to this, how do I keep all regions with at least one read and sum up the size of the intervals with bedtools? Here is my idea:
bedtools genomecov -ibam input.bam -bg to get the regions with at least one read, and then how can I calculate total size?
You should be able to figure this out using the bedtools manual, in this case you probably need bedtools genomecov.
I got this:
1891663623, so these are bp? Do these numbers seem reasonable?
What I did:
This is more than half the human genome. You are essentially counting every base pair that received at least one read in RNA-seq, probably (most likely) containing a lot of transcriptional or experimental noise. I would intersect this bedGraph with a list of well-annotated gene coordinates (exons), e.g. from GENCODE, and focus on these high-confidence regions for your analysis. That should then narrow it down to a few percent of the total genome size.
As I asked in a similar question that was closed (thought it was different enough to be asked separately), could I determine the transcriptome size by multiplying uniquely mapped reads by read size?
Here is an example of the report
No, this naive analysis is not getting you anywhere:
Scan the literature and find a well-established procedure for this rather than custom solutions.
No that is total nonsense. Two 100bp reads mapping on the same location would get you a transcriptome size of 200 bp, while they're at the same location.
It was not.
I came up with this solution that a contributor in this page proposed to calculate the percentage of reference genome covered :
Knowing the hg38 reference genome has a size of: 3,257,347,282 bp --> 3257 Mb, so calculating the percentage of this 3257 Mb I can obtain the size of my transcriptome to calculate TMB, am I correct?
I think that's my code? It was for DNA-seq though
No.