Issues with strandedness, flags, and generating a count matrix using R. Please help!
0
1
Entering edit mode
9.0 years ago

Hi all,

I'm new to RNA-seq analysis, and I'm struggling to solve a problem.

I have 8 samples: 2 treatments with 4 reps each, so I'm just trying to do a simple differential gene expression comparison. I have strand-specific, 100bp paired end read libraries for all these, and I'm using R to generate a count matrix with the function summarizeOverlaps and using the following code:

for(i in 1:length(my.bam.files)){
  for(j in 1:13){
    print(j)
    chr.param<-ScanBamParam(which=itag2.4.gr[j], flag=my.flag, what=c("mapq", "flag"))
    bam.ga <- readGAlignments(my.bam.files[i], param=chr.param)
    good.reads <- which(mcols(bam.ga)$flag %in% good.flags & mcols(bam.ga)$mapq >= mapq.cutoff)
    bam.ga <- bam.ga[good.reads]
    my.count <- summarizeOverlaps(ebg, bam.ga, mode="Union", ignore.strand=F)
    count.mat[, i] <- count.mat[, i] + assays(my.count)$counts
    rm(bam.ga, my.count)
    gc()
  }

Now in my understanding, a stringent analysis would just keep 1 read from each aligned pair. I tried to do this by keeping either reads with flags 83 and 99 (proper, mapped pairs, R1 reads) or reads with flags 147, 163 (proper mapped pairs, R2 reads). As I understood it, I should generate the exact same count matrix regardless of which pair I keep, as R1 and R2 reads should map to the same gene. However, this is not the case, as keeping R2 reads generates a matrix with a count matrix that looks fairly normal (about a third with no expression, and then a mix of low, medium, and high count numbers). When I keep R1 reads, the count numbers are very low and very few genes have any counts at all. Am I doing something wrong? Is my understanding of what each read corresponds to incorrect?

I've heard from other researchers that it's ok to just keep both reads and do the analysis on that, but I'm not confident now that that will give me a robust answer since I'm confused about whether or not the count matrix generated is actually accurate.

Any help or suggestions appreciated!

RNA-Seq • 1.8k views
ADD COMMENT
0
Entering edit mode
the best thing you could do is to use software that was developed for this purpose. see htseq count for example. it will deal with strandedness and read pairs in the proper way.
ADD REPLY

Login before adding your answer.

Traffic: 1608 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6