Question

What is the target region size for total rnaseq?

0

Entering edit mode

6.4 years ago

Pin.Bioinf ▴ 350

Hello,

I want to calculate TMB (mutational load), i know it will not be accurate because its RNA-Seq, but what is the target region generally for RNA-Seq? Similar to WES (30Mb)? We sequenced with 100b length reads.

I need this to divide non synonymous mutations by this region size number.

Thank you

RNA-Seq • 2.1k views

ADD COMMENT • link 6.4 years ago by Pin.Bioinf ▴ 350

3

Entering edit mode

This totally depends on the number of reads and the tissue you're sequencing. What you could do is convert bam to bed, to get all regions with at least one read (or filter on a higher minimum number of reads). Then sum up the size of the intervals to get the size of the transcriptome. You can do all of this using bedtools.

ADD REPLY • link 6.4 years ago by WouterDeCoster 48k

0

Entering edit mode

Thank you! could you help me with the command? I am new to this, how do I keep all regions with at least one read and sum up the size of the intervals with bedtools? Here is my idea:

bedtools genomecov -ibam input.bam -bg to get the regions with at least one read, and then how can I calculate total size?

ADD REPLY • link 6.4 years ago by Pin.Bioinf ▴ 350

0

Entering edit mode

You should be able to figure this out using the bedtools manual, in this case you probably need bedtools genomecov.

ADD REPLY • link 6.4 years ago by WouterDeCoster 48k

0

Entering edit mode

I got this:

1891663623, so these are bp? Do these numbers seem reasonable?

What I did:

bedtools genomecov -ibam input.bam -bg | bedtools merge -i - > covered_36.bed 

cat covered_36.bed | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'

1891663623

ADD REPLY • link 6.4 years ago by Pin.Bioinf ▴ 350

1

Entering edit mode

This is more than half the human genome. You are essentially counting every base pair that received at least one read in RNA-seq, probably (most likely) containing a lot of transcriptional or experimental noise. I would intersect this bedGraph with a list of well-annotated gene coordinates (exons), e.g. from GENCODE, and focus on these high-confidence regions for your analysis. That should then narrow it down to a few percent of the total genome size.

ADD REPLY • link 6.4 years ago by ATpoint 88k

0

Entering edit mode

As I asked in a similar question that was closed (thought it was different enough to be asked separately), could I determine the transcriptome size by multiplying uniquely mapped reads by read size?

Here is an example of the report

                  Number of input reads |       57456000
              Average input read length |       150
                            UNIQUE READS:
           Uniquely mapped reads number |       47797002
                Uniquely mapped reads % |       83.19%
                  Average mapped length |       149.31
               Number of splices: Total |       3683773
    Number of splices: Annotated (sjdb) |       3610561
               Number of splices: GT/AG |       3573654
               Number of splices: GC/AG |       38516
               Number of splices: AT/AC |       4614
       Number of splices: Non-canonical |       66989
              Mismatch rate per base, % |       0.73%
                 Deletion rate per base |       0.07%
                Deletion average length |       2.06
                Insertion rate per base |       0.01%
               Insertion average length |       1.42

ADD REPLY • link 6.4 years ago by Pin.Bioinf ▴ 350

1

Entering edit mode

No, this naive analysis is not getting you anywhere:

It does not at all correct for contamination by rRNA and gDNA. Given you had 30% gDNA contamination, you would overestimate the transcriptome by 30%.
You would need to exclude any duplicates to avoid counting redundant information.
You would need an elaborate way to deal with multimappers which cannot be assigned to a unique location due to its repetitive nature.
You would not account for TLEN in paired-end sequencing, so all basepairs between the mate pairs that have not been sequenced. The real fragment length would need to be taken into account.

Scan the literature and find a well-established procedure for this rather than custom solutions.

ADD REPLY • link 6.4 years ago by ATpoint 88k

0

Entering edit mode

No that is total nonsense. Two 100bp reads mapping on the same location would get you a transcriptome size of 200 bp, while they're at the same location.

thought it was different enough to be asked separately

It was not.

ADD REPLY • link 6.4 years ago by WouterDeCoster 48k

0

Entering edit mode

I came up with this solution that a contributor in this page proposed to calculate the percentage of reference genome covered :

#Determine number of bases at 0 read depth
zero=$(bedtools genomecov -ibam BAM -g hg38.fasta -bga | awk '$4==0 {bpCountZero+=($3-$2)} {print bpCountZero}' | tail -1)

#Determine number of bases at >0 read depth, i.e., non-zero bases
nonzero=$(bedtools genomecov -ibam BAM -g hg38.fasta -bga | awk '$4>0 {bpCountNonZero+=($3-$2)} {print bpCountNonZero}' | tail -1)

#Calculate the percent of the reference genome coverd by >0 read depth bases
#Round up to 6 decimal places
percent=$(bc <<< "scale=6; ($nonzero / ($zero + $nonzero))*100")

echo $percent

Knowing the hg38 reference genome has a size of: 3,257,347,282 bp --> 3257 Mb, so calculating the percentage of this 3257 Mb I can obtain the size of my transcriptome to calculate TMB, am I correct?

ADD REPLY • link 6.4 years ago by Pin.Bioinf ▴ 350

0

Entering edit mode

I think that's my code? It was for DNA-seq though

ADD REPLY • link 6.4 years ago by Kevin Blighe 89k

0

Entering edit mode

am I correct

No.

No, this naive analysis is not getting you anywhere:

It does not at all correct for contamination by rRNA and gDNA. Given you had 30% gDNA contamination, you would overestimate the transcriptome by 30%.

You would need to exclude any duplicates to avoid counting redundant information.

You would need an elaborate way to deal with multimappers which cannot be assigned to a unique location due to its repetitive nature.

You would not account for TLEN in paired-end sequencing, so all basepairs between the mate pairs that have not been sequenced. The real fragment length would need to be taken into account.

Scan the literature and find a well-established procedure for this rather than custom solutions.

Scan the literature and find a well-established procedure for this rather than custom solutions.

ADD REPLY • link 6.4 years ago by ATpoint 88k