Counting Reads On Paired-End Strand-Specific Rnaseq Data
2
0
Entering edit mode
11.5 years ago
Arun 2.4k

Hello everybody,

I've a strand-specific paired-end library on which I'd like to perform some standard DGE analysis. However I'm quite unclear about how to go about counting reads. Usually when I have a paired-end (unstranded) library, I first clip them for adapters and trim for quality. And when doing these preprocessing, if one of the pairs passes the filtering and the other doesn't, then I retain them as a single end read. Then, I use tophat to first map the PE reads (where both pairs are retained after filtering) and then pass the junctions obtained from that mapping step to run tophat again on the single end reads so as to not loose those otherwise "good" reads.

Now, the way I thought about this is to first do the same procedure to obtain PE reads that pass filtering and retain them. Then, those reads where only 1 of the pair passed filtering will be stored in two separate files as SE reads (depending on which read is retained) as they are strand-specific.

Even if this is good, again, after mapping, while counting the reads, I am unsure how to count the reads.

So my questions are: How do people normally go about mapping a PE library? Do they only keep "properly mapped pairs"? If so, do they count each pair as 1 read (as they are indeed coming from one fragment)? When you also have SE reads in the same bam file, how can you then count the number of reads? And if there are properly paired SS reads and SS SE reads, how does one go about counting the reads per gene?

I know it's a lot of questions, but the essence of it is how to go about counting reads in an unstranded and stranded library when you've got both PE and SE reads within the same bam file... Or is it not a good practice and one should throw away these otherwise good SE reads?

Thank you very much.

rna-seq gene-expression paired-end strand • 8.0k views
ADD COMMENT
0
Entering edit mode
11.5 years ago
fridhackery ▴ 170

If you wish to keep reads where the mate does not map, you should count these as one, and the paired reads as two.

The paired reads, coming from one fragment, will not map to the same exact location. Therefore, you are essentially counting how many times you observe each base in your reference, which is coverage. To compare between your runs, you will at least normalize by reads-per-million. The paired reads are like longer reads, so will increase the observed coverage of the gene across its length. The un-paired read can only increase the coverage on half that length.

ADD COMMENT
0
Entering edit mode

I think I would count each pair where both ends map as one count and then if only one read of the pair maps I would still count that as one count.

My logic is this: in DGE one of the reads often will map more poorly because it comes from the poly-A tail (assuming you are mapping to the genome). If one of the two maps it is still likely indicates the presence of a single molecule of RNA (or cDNA). If two fragments of the pair maps it still indicates the presence of a single RNA, not two.

But I haven't looked at this very closely.

You can try ESAT http://garberlab.umassmed.edu/software/esat/quantification.html

They have a more complex logic involving peak finding for DGE.

I am assuming by DGE you are referring to end sequencing?

ADD REPLY
0
Entering edit mode

@MicheleBusby, DGE is differential gene expression and we're talking about paired-end RNA-sequencing with Illumina. mRNA is fragmented before converting to (c)DNA and further processing. There's no particular bias I am aware of, therefore, reg. one end always mapping poorly because it's from the Poly-A tail.

ADD REPLY
0
Entering edit mode

Ah, sorry. We use DGE to refer to "digital gene expression" which is end sequencing. This terminology might be a Broad thing, or maybe just our lab.

Is your percentage of pairs that has only one read mapping very high? In that case, do you have a sense why one of the reads isn't mapping?

We only keep reads where both reads pass filtering because our pass filter rate is usually high enough that the loss of reads is trivial and the mates of the sketchy ones might be sketchy.

For full length RNA Seq we've been using the RSEM pipeline which aligns to the reads to the transcriptome with the poly-a sequences included at the end of transcripts and it does all the quantification for us. RSEM maps with Bowtie because it doesn't need to find splice junctions because it's the transcriptome which is also way faster. I am pretty sure it only uses alignments where both reads map but the counting is complex because it uses the transcriptome alignment to extrapolate which isoforms are expressed and has some correction for GC bias. This, of course, is overkill if you are doing yeast or something without much splicing. But you still need to account for your multiply mapped reads which can be a little tricky.

When I worked in yeast we just counted the uniquely mapped reads. We didn't have paired end then but I would have counted pairs as 1, and probably just kept the ones where both mapped.

ADD REPLY
0
Entering edit mode
11.5 years ago

I dont know about others but I keep both properly mapped pairs and also mapped read from a pair where only one read got mapped but this read should either have high mapping quality i.e. 20 or it should be uniquely aligner according to the aligner. Also, I agree with what fridhackery said above.

ADD COMMENT

Login before adding your answer.

Traffic: 1960 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6