Background: I mapped my reads against the reference genome using the STAR command:
star
--runThreadN 10
--runMode alignReads
--readFilesCommand zcat
--outFilterMultimapNmax 1
--outFilterMatchNmin 35
--outSAMtype BAM Unsorted SortedByCoordinate
--quantMode GeneCounts
--twopassMode Basic
--outFileNamePrefix "$newfilename"
--genomeDir $indexedreference
--sjdbGTFfile $gtfreference
--readFilesIn $inputfile1 $inputfile2
Output: As explained on the STAR manual (here), I get a file called *ReadsPerGene.out.tab with the following results:
N_unmapped 20460635 20460635 20460635
N_multimapping 0 0 0
N_noFeature 23097482 39371367 23239579
N_ambiguous 497097 12313 293179
ENSDARG00000009262 139 0 139
ENSDARG00000018618 2372 5 2367
ENSDARG00000045874 206 4 303
ENSDARG00000057468 13 101 13
ENSDARG00000045873 273 2 271
ENSDARG00000079143 72 0 72
ENSDARG00000057714 206 31 175
Explanation from the manual:
Counting number of reads per gene. With --quantMode GeneCounts option STAR will count number reads per gene while mapping. A read is counted if it overlaps (1nt or more) one and only one gene. Both ends of the pairedend read are checked for overlaps. The counts coincide with those produced by htseq-count with default parameters. This option requires annotations (GTF or GFF with –sjdbGTFfile option) used at the genome generation step, or at the mapping step. STAR outputs read counts per gene into ReadsPerGene.out.tab file with 4 columns which correspond to different strandedness options:
column 1: gene ID
column 2: counts for unstranded RNA-seq
column 3: counts for the 1st read strand aligned with RNA (htseq-count option -s yes)
column 4: counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse)
Select the output according to the strandedness of your data. Note, that if you have stranded data and choose one of the columns 3 or 4, the other column (4 or 3) will give you the count of antisense reads. With --quantMode TranscriptomeSAM GeneCounts, and get both the Aligned.toTranscriptome.out.bam and ReadsPerGene.out.tab outputs.
Questions:
- Which column should I use for differential expression analysis? Why?
- Should the sum of columns 3 and 4 equal column 2? Why or why not? What does it mean when they are not equal?
- Why for some transcripts the value in column 3 is higher than the value in column 4, while for most of the transcripts the value in column 4 is higher than the value in column 3?
- What does it mean when the value in column 3 or 4 is higher than the value in column 2?
Thank you!
Is this a homework assignment?
Thanks for the clear explanations. If I need FPKM data, do I use these read counts ?