Hello everybody,
I've a strand-specific paired-end library on which I'd like to perform some standard DGE analysis. However I'm quite unclear about how to go about counting reads. Usually when I have a paired-end (unstranded) library, I first clip them for adapters and trim for quality. And when doing these preprocessing, if one of the pairs passes the filtering and the other doesn't, then I retain them as a single end read. Then, I use tophat to first map the PE reads (where both pairs are retained after filtering) and then pass the junctions obtained from that mapping step to run tophat again on the single end reads so as to not loose those otherwise "good" reads.
Now, the way I thought about this is to first do the same procedure to obtain PE reads that pass filtering and retain them. Then, those reads where only 1 of the pair passed filtering will be stored in two separate files as SE reads (depending on which read is retained) as they are strand-specific.
Even if this is good, again, after mapping, while counting the reads, I am unsure how to count the reads.
So my questions are: How do people normally go about mapping a PE library? Do they only keep "properly mapped pairs"? If so, do they count each pair as 1 read (as they are indeed coming from one fragment)? When you also have SE reads in the same bam file, how can you then count the number of reads? And if there are properly paired SS reads and SS SE reads, how does one go about counting the reads per gene?
I know it's a lot of questions, but the essence of it is how to go about counting reads in an unstranded and stranded library when you've got both PE and SE reads within the same bam file... Or is it not a good practice and one should throw away these otherwise good SE reads?
Thank you very much.
I think I would count each pair where both ends map as one count and then if only one read of the pair maps I would still count that as one count.
My logic is this: in DGE one of the reads often will map more poorly because it comes from the poly-A tail (assuming you are mapping to the genome). If one of the two maps it is still likely indicates the presence of a single molecule of RNA (or cDNA). If two fragments of the pair maps it still indicates the presence of a single RNA, not two.
But I haven't looked at this very closely.
You can try ESAT http://garberlab.umassmed.edu/software/esat/quantification.html
They have a more complex logic involving peak finding for DGE.
I am assuming by DGE you are referring to end sequencing?
@MicheleBusby, DGE is differential gene expression and we're talking about paired-end RNA-sequencing with Illumina. mRNA is fragmented before converting to (c)DNA and further processing. There's no particular bias I am aware of, therefore, reg. one end always mapping poorly because it's from the Poly-A tail.
Ah, sorry. We use DGE to refer to "digital gene expression" which is end sequencing. This terminology might be a Broad thing, or maybe just our lab.
Is your percentage of pairs that has only one read mapping very high? In that case, do you have a sense why one of the reads isn't mapping?
We only keep reads where both reads pass filtering because our pass filter rate is usually high enough that the loss of reads is trivial and the mates of the sketchy ones might be sketchy.
For full length RNA Seq we've been using the RSEM pipeline which aligns to the reads to the transcriptome with the poly-a sequences included at the end of transcripts and it does all the quantification for us. RSEM maps with Bowtie because it doesn't need to find splice junctions because it's the transcriptome which is also way faster. I am pretty sure it only uses alignments where both reads map but the counting is complex because it uses the transcriptome alignment to extrapolate which isoforms are expressed and has some correction for GC bias. This, of course, is overkill if you are doing yeast or something without much splicing. But you still need to account for your multiply mapped reads which can be a little tricky.
When I worked in yeast we just counted the uniquely mapped reads. We didn't have paired end then but I would have counted pairs as 1, and probably just kept the ones where both mapped.