Question

Intron-aware coverage for paired-end RNA-seq data.

0

Entering edit mode

9.2 years ago

Carlo Yague 8.9k

Hi guys,

For visualization purpose (like in a genome browser), I would like to get the per-base genome-wide coverage of paired-end RNA-seq data. I thought this would be easy, turns out it is not. Here is a picture showing what I want to do :

And here is where I came so far :

I start with mapped paired-end reads (accepted_hits.bam)
Then I split my file into two strand specific files according to this tutorial.
Then I use bedtools bamtobed with the -bedpe option to get a bed file that summarizes my paired reads. From there I can easily retrieve the start and end positions of the fragments and calculate coverage (see code below). Hurray. However I loose information on the introns...

And it is where I'm stuck. I thought of using bedtools bamtobed with the -split option along with -bedpe, but those two options seems not compatible. A possible solution would be to extract the halfs of the reads spanning introns that are not linked to the mate pair, save them as singleton and add them later to the coverage, but I have no idea how to do that. Any suggestion, even for a completely different method? Thank you!

samtools sort -no <(samtools view -b -f 2 fwd.bam) temp | bamToBed -i stdin -bedpe | cut -f 1,2,6 | sort -k 1,1 > fwd_merged_mates.bed
bedtools genomecov -bga -i fwd_merged_mates.bed -g chrom_summary  > WT1_fwd.bg

RNA-Seq bam • 4.2k views

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by Carlo Yague 8.9k

0

Entering edit mode

Your goal is get genome coverage from only exons ( and stranded ) ?

ADD REPLY • link 9.2 years ago by GouthamAtla 12k

0

Entering edit mode

No, I want genome coverage for the whole genome. But if a read if split by an intron (mapping with tophat allows it), I need to take that into account when I compute the coverage, i.e, I can't just count +1 between the start and the end of a fragment if one of the read forming the fragment is split by an intron... Hope this clarify my question :)

ADD REPLY • link 9.2 years ago by Carlo Yague 8.9k

0

Entering edit mode

I guess you simply needs number of reads per gene/Exon. You could use HTSeq-count for counting the number of reads per gene, which takes care of split reads without double counting them. If you need exon level counts, DEXSeq provides some scripts that will flatten your GTF file and can be used with HTSeq-count to get exon level information.

ADD REPLY • link 9.2 years ago by GouthamAtla 12k

0

Entering edit mode

No, I thank you for your interest but this is not what I want. Instead I want to generate coverage tracks that cover the whole genome for visualization in a genome browser. We could call it "stranded per base count" if u want, in contrast with exon/gene count. I edited the question to make this point clear.

ADD REPLY • link 9.2 years ago by Carlo Yague 8.9k

0

Entering edit mode

For me is still unclear what you are trying to do and frankly this:

A possible solution would be to extract the halfs of the reads spanning introns that are not linked to the mate pair, save them as singleton and add them later to the coverage

Sounds like a bad idea. There is a rule that says: if it is hard to explain it is probably a bad idea.

IMHO If all you want is coverage then first compute the whole genome coverage ignoring genes for each strand then slice and dice that file with bedtools.

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by Istvan Albert 102k

0

Entering edit mode

Ok there seems to be some misunderstanding. When I say "intron", I'm not talking about actual intron, but about reads that are mapped discontinuously by tophat (although the two concepts are linked). And indeed, all I want is the whole coverage ignoring genes. But because of those discontinuity, I find it hard to compute it :

With single end reads and no intron, its easy, you just take the start and end of the reads and count +1 within the interval.
With single reads and introns, you split intron-containing reads into two reads and proceed as before.
With paired end and no intron, its also easy, you just take the start of the read 1, the end of the read 2 and count +1 within the interval.
But I'm confused on what to do when you have both paired end and introns to take into account.

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by Carlo Yague 8.9k

0

Entering edit mode

There isn't really a good solution for the case you mention, since you can't know with certainty that there's no splicing going on between your mates (yes, you can just assume that the GTF defines all legitimate splicing events, but that is an unjustifiable assumption in many situations).

ADD REPLY • link 9.2 years ago by Devon Ryan 104k

0

Entering edit mode

Very nice point.

Currently I'm trying samtools depth. I need little more testing, but it looks like it does what I want.

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by Carlo Yague 8.9k

Ram · Answer 1 · 2015-09-24

1

Entering edit mode

9.2 years ago

Mike ▴ 60

It sounds like what you want to do is convert an aligned .sam file to a wiggle/bedgraph file, is this correct? The STAR RNA alignment program has an option to do this but I haven't used it.

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by Mike ▴ 60

0

Entering edit mode

Exactly! I'll look into STAR RNA. Will update.

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by Carlo Yague 8.9k