Hi all,
I'm new to RNA-seq analysis, and I'm struggling to solve a problem.
I have 8 samples: 2 treatments with 4 reps each, so I'm just trying to do a simple differential gene expression comparison. I have strand-specific, 100bp paired end read libraries for all these, and I'm using R to generate a count matrix with the function summarizeOverlaps and using the following code:
for(i in 1:length(my.bam.files)){
for(j in 1:13){
print(j)
chr.param<-ScanBamParam(which=itag2.4.gr[j], flag=my.flag, what=c("mapq", "flag"))
bam.ga <- readGAlignments(my.bam.files[i], param=chr.param)
good.reads <- which(mcols(bam.ga)$flag %in% good.flags & mcols(bam.ga)$mapq >= mapq.cutoff)
bam.ga <- bam.ga[good.reads]
my.count <- summarizeOverlaps(ebg, bam.ga, mode="Union", ignore.strand=F)
count.mat[, i] <- count.mat[, i] + assays(my.count)$counts
rm(bam.ga, my.count)
gc()
}
Now in my understanding, a stringent analysis would just keep 1 read from each aligned pair. I tried to do this by keeping either reads with flags 83 and 99 (proper, mapped pairs, R1 reads) or reads with flags 147, 163 (proper mapped pairs, R2 reads). As I understood it, I should generate the exact same count matrix regardless of which pair I keep, as R1 and R2 reads should map to the same gene. However, this is not the case, as keeping R2 reads generates a matrix with a count matrix that looks fairly normal (about a third with no expression, and then a mix of low, medium, and high count numbers). When I keep R1 reads, the count numbers are very low and very few genes have any counts at all. Am I doing something wrong? Is my understanding of what each read corresponds to incorrect?
I've heard from other researchers that it's ok to just keep both reads and do the analysis on that, but I'm not confident now that that will give me a robust answer since I'm confused about whether or not the count matrix generated is actually accurate.
Any help or suggestions appreciated!