Question

Intersectbed/Coveragebed -Split Purify Exon?

2

Entering edit mode

12.8 years ago

Puriney ▴ 330

all.reads.bam file records mapped RNA-seq reads data, including:

exon:exon junction
exon body
intron body
exon:intron junction

Q1: When calculating RPKM for given RefSeq gene including all the position reads, will the following command just calculate exon:exon junction reads and at same time ignore all other reads?

coverageBED -abam all.reads.bam -b refseq.genes.BED12.bed -s -split >coverage.bed

I'm confused by the mannual (Page 62):

When dealing with RNA-seq reads, for example, one typically wants to only tabulate coverage for the portions of the reads that come from exons (and ignore the interstitial intron seqeunce), The -split command allows for such coverage to be performed.

If "-split" is set, the exon:exon read (for example, 30M3000N46M") exists in -abam bam file, and the 3000N will NOT be wrongly intersected when running intersectBED command. But what about coverageBED command? I do hope the 3000N will be not calculated which makes sense, and I also hope the intron body reads and other reads will be NOT ignored.

Q2: If one just want to calculate exon's RPKM, does it mean one should prepare -b file to record all the exon information, and run like this:

coverageBED -abam all.reads.bam -b all.exons.bed -s >coverage.bed

Q3: How to calculate RPKM for given genes whose reads overlap with exons? We all known BED12 format file record the exon information (start and length) for given RefSeq gene. Could BEDtools do the magic things? Note this calculation is different from Q1.

Thank you, and looking forward to your replies.

bedtools coverage intersect • 7.7k views

ADD COMMENT • link updated 12.8 years ago by johannes.helmuth ▴ 110 • written 12.8 years ago by Puriney ▴ 330

score 1 · Answer 1 · 2012-09-27

I am confused because you talk about calculating RPKM but, with coverageBed, you do not calculate a RPKM value. Taken from the documentation, the program gives you information on

1) The number of features in A that overlapped the B interval.
2) The number of bases in B that had non-zero coverage.
3) The length of the entry in B.
4) The fraction of bases in B that had non-zero coverage.

But, of course, you can calculate the RPKM in you experiment with knowledge about transcript length and sequencing depth.

Additionally: Please note, that the -s flag counts only reads which align to the respective strand of your desired region. If your experiment is not strand-specific, you should omit this flag as reads originating from one region can align to both strands.

Regarding your questions...

Q1: When calculating RPKM for given RefSeq gene including all the position reads, will the following command just calculate exon:exon junction reads and at same time ignore all other reads?

coverageBED -abam all.reads.bam -b refseq.genes.BED12.bed -s -split >coverage.bed

No. All reads are counted. Bedtools will treat split reads as two separate regions for counting.

E.g., Given your refseq.genes.BED12.bed contains transcript information with all exon regions, the command counts the coverage for the transcripts (all the contained exons). It will therefore incorporate all mapping reads and also the split reads (mapping to more than one exon). The -split flag causes the following:

Treat "split" BAM or BED12 entries as distinct BED intervals.
when computing coverage.
For BAM files, this uses the CIGAR "N" and "D" operations 
to infer the blocks for computing coverage

This means the spanning region of a split read which is not aligned to the read will not be counted as coverage. This is particular useful if you have exon-skipping information or want to know reads mapping to specific intron regions.

Q2: If one just want to calculate exon's RPKM, does it mean one should prepare -b file to record all the exon information, and run like this:

coverageBED -abam all.reads.bam -b all.exons.bed -s >coverage.bed

Yes. But you should also set the -split flag. This ensures that split-reads that span skipped exons are mistakenly counted. If you don't have a strand-specific protocol for your library preparation, you also omit the -s flag.

Q3: How to calculate RPKM for given genes whose reads overlap with exons? We all known BED12 format file record the exon information (start and length) for given RefSeq gene. Could BEDtools do the magic things? Note this calculation is different from Q1.

You have all the necessary information for calculating the RPKM on your own. The length of the transcript is the sum of the lengths of the incorporated exons. The coverage you can get from coverageBed like explained above. The RPKM can then be computed

Note, that this does not address alternative isoform usage. This is whole different problem but the exon-exon-reads may provide some insight on this.