Intron-aware coverage for paired-end RNA-seq data.
2
0
Entering edit mode
9.2 years ago

Hi guys,

For visualization purpose (like in a genome browser), I would like to get the per-base genome-wide coverage of paired-end RNA-seq data. I thought this would be easy, turns out it is not. Here is a picture showing what I want to do :

image

And here is where I came so far :

  • I start with mapped paired-end reads (accepted_hits.bam)
  • Then I split my file into two strand specific files according to this tutorial.
  • Then I use bedtools bamtobed with the -bedpe option to get a bed file that summarizes my paired reads. From there I can easily retrieve the start and end positions of the fragments and calculate coverage (see code below). Hurray. However I loose information on the introns...

And it is where I'm stuck. I thought of using bedtools bamtobed with the -split option along with -bedpe, but those two options seems not compatible. A possible solution would be to extract the halfs of the reads spanning introns that are not linked to the mate pair, save them as singleton and add them later to the coverage, but I have no idea how to do that. Any suggestion, even for a completely different method? Thank you!

samtools sort -no <(samtools view -b -f 2 fwd.bam) temp | bamToBed -i stdin -bedpe | cut -f 1,2,6 | sort -k 1,1 > fwd_merged_mates.bed
bedtools genomecov -bga -i fwd_merged_mates.bed -g chrom_summary  > WT1_fwd.bg
RNA-Seq bam • 4.2k views
ADD COMMENT
0
Entering edit mode

Your goal is get genome coverage from only exons ( and stranded ) ?

ADD REPLY
0
Entering edit mode

No, I want genome coverage for the whole genome. But if a read if split by an intron (mapping with tophat allows it), I need to take that into account when I compute the coverage, i.e, I can't just count +1 between the start and the end of a fragment if one of the read forming the fragment is split by an intron... Hope this clarify my question :)

ADD REPLY
0
Entering edit mode

I guess you simply needs number of reads per gene/Exon. You could use HTSeq-count for counting the number of reads per gene, which takes care of split reads without double counting them. If you need exon level counts, DEXSeq provides some scripts that will flatten your GTF file and can be used with HTSeq-count to get exon level information.

ADD REPLY
0
Entering edit mode

No, I thank you for your interest but this is not what I want. Instead I want to generate coverage tracks that cover the whole genome for visualization in a genome browser. We could call it "stranded per base count" if u want, in contrast with exon/gene count. I edited the question to make this point clear.

ADD REPLY
0
Entering edit mode

For me is still unclear what you are trying to do and frankly this:

A possible solution would be to extract the halfs of the reads spanning introns that are not linked to the mate pair, save them as singleton and add them later to the coverage

Sounds like a bad idea. There is a rule that says: if it is hard to explain it is probably a bad idea.

IMHO If all you want is coverage then first compute the whole genome coverage ignoring genes for each strand then slice and dice that file with bedtools.

ADD REPLY
0
Entering edit mode

Ok there seems to be some misunderstanding. When I say "intron", I'm not talking about actual intron, but about reads that are mapped discontinuously by tophat (although the two concepts are linked). And indeed, all I want is the whole coverage ignoring genes. But because of those discontinuity, I find it hard to compute it :

  • With single end reads and no intron, its easy, you just take the start and end of the reads and count +1 within the interval.
  • With single reads and introns, you split intron-containing reads into two reads and proceed as before.
  • With paired end and no intron, its also easy, you just take the start of the read 1, the end of the read 2 and count +1 within the interval.
  • But I'm confused on what to do when you have both paired end and introns to take into account.
ADD REPLY
0
Entering edit mode

There isn't really a good solution for the case you mention, since you can't know with certainty that there's no splicing going on between your mates (yes, you can just assume that the GTF defines all legitimate splicing events, but that is an unjustifiable assumption in many situations).

ADD REPLY
0
Entering edit mode

Very nice point.

Currently I'm trying samtools depth. I need little more testing, but it looks like it does what I want.

ADD REPLY
1
Entering edit mode
9.2 years ago
Mike ▴ 60

It sounds like what you want to do is convert an aligned .sam file to a wiggle/bedgraph file, is this correct? The STAR RNA alignment program has an option to do this but I haven't used it.

ADD COMMENT
0
Entering edit mode

Exactly! I'll look into STAR RNA. Will update.

ADD REPLY

Login before adding your answer.

Traffic: 1956 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6