I have got some RNAseq data from Mus Musculus using TruSeq Stranded Total RNA in paired-end. I want to count my reads over my genes regarding to the gene orientation. As if a gene is plus strand, reads on forward strand falling into my plus gene area should be counted and if a gene is minus strand, reads on reverse strand falling into my minus gene area should be counted.
Let's take two examples from Ensembl :
Hmgb2
:ENSMUSG00000054717.7
chr8:57,511,907-57,515,999
forward strand
http://www.ensembl.org/Mus_musculus/Gene/Summary?db=core;g=ENSMUSG00000076617;r=12:113418558-113422730Ighm
:ENSMUSG00000076617.9
chr12:113,418,558-113,422,730
reverse strand
http://www.ensembl.org/Mus_musculus/Gene/Summary?db=core;g=ENSMUSG00000054717;r=8:57511907-57515999
I created my index with a gencode reference genome and I did my alignments using STAR with the --quantMode
and an annotation file from gencode. I've got my _Aligned.sortedByCoord.out.bam
and _ReadsPerGene.out.tab
.
Check the strand of my 2 genes into my annotation file :
zgrep -i --color "Hmgb2" gencode.vM18.chr_patch_hapl_scaff_and_23_custom_igh_gff3sort.annotation.gtf.gz
#chr8 HAVANA gene 57511907 57515999 . + . gene_id "ENSMUSG00000054717.7"; gene_type "protein_coding"; gene_name "Hmgb2"; level 2; havana_gene "OTTMUSG00000060717.1";
zgrep -i --color "Ighm" gencode.vM18.chr_patch_hapl_scaff_and_23_custom_igh_gff3sort.annotation.gtf.gz
#chr12 HAVANA gene 113418558 113422730 . - . gene_id "ENSMUSG00000076617.9"; gene_type "IG_C_gene"; gene_name "Ighm"; level 2; havana_gene "OTTMUSG00000051485.2";
Ok, good I have Hmgb2
plus strand and Ighm
minus strand.
Check the read strand under IGV for these two genes :
Hmgb2
Ighm
Hmgb2
gets a lot of paired-read F2R1 (blue reads) and vice-versa Ighm
gets a lot of paired-read F1R2 (red reads)
Note : As paired-end Illumina sequencing is R1 forward and R2 reverse, I was expecting R1 to lead the forward strand, which is not my case but anyway, it is not the point, R2 is leading the forward strand.
Reading the documentation of STAR, part 7.Counting number of reads per gene.
STAR outputs read counts per gene into ReadsPerGene.out.tab file with 4 columns which correspond to different strandedness options:
- column 1: gene ID
- column 2: counts for unstranded RNA-seq
- column 3: counts for the 1st read strand aligned with RNA (htseq-count option -s yes)
- column 4: counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse)
Select the output according to the strandedness of your data. Note, that if you have stranded data and choose one of the columns 3 or 4, the other column (4 or 3) will give you the count of antisense reads.
So, what I get is that, either the 3rd column give me my sense count or antisense, and the 4th will give me the opposite strand result.
Fine. But I want to know if STAR take the gene strand into account. I've found this threa thread (with $2 = column2, $3 = column3, $4 = column4)
$2 is for unstranded hits, but those overlapping on opposite strand of features are considered ambiguous. $3 reports hits based on the strand you have given in your gff annotation, and $4 in the reverse direction of your features in gff (for PE-data the 5'3'-direction is also considered). Refer to -s option of htseq-count
So, I was expecting to get high count of Hmgb2
in either column 3 or 4 and high count of Ighm
in the other column
Hmgb2
/ ENSMUSG00000054717.7
grep "ENSMUSG00000054717.7" file_ReadsPerGene.out.tab
#ENSMUSG00000054717.7 3400 31 3369
Ighm
/ ENSMUSG00000076617.9
grep "ENSMUSG00000076617.9" file_ReadsPerGene.out.tab
#ENSMUSG00000076617.9 16063 11 16052
All my high counts are in the 4th column... Did I forgot to tweak some options ?
It does not bother me to have my count in the antisense column but I was expecting my gene counts from 2 genes stranded in different way to be in different column.
Unless, a read is counted in the feature strandness ? Like, one read in reverse in a minus strand gene is counted as forward ?
Since not everyone does paired end, the first read is what is used for strand determination. Your fragment corresponds to the complement of your actual mRNA, so your first read will be anti-sense and your second read will be sense.
Making the call based on R1 is logical because for Illumina single-end sequencing you are just sequencing read #1 - so in all possible cases you have read #1.
I cannot see your pictures please see : How to add images to a Biostars post
Something is off about Biostars and images...I can see the images fine on my phone, but not at my work desktop.
What about now?
All good now 👍
Still no on my desktop...
Happend to me yesterday