Hi,
I am analyzing RNA-seq data produced with Lexogen kit "Quant-Seq 3' mRNA-Seq FWD-UMI".
My samples are FFPE samples and for each sample we sequenced 4 millions of reads 1xSE 75 nt.
I mapped the reads to the human reference genome hg38 using star and I obtained that the 80% of reads uniquely mapped to the genome and 30% multi-mapped reads.
Subsequently, I deduplicated the mapped reads based on UMI sequence. Then I checked the reads quality distribution using RSeQC and I calculated gene counts using HTSEQ-count.
Looking at rseqc output, I have observed a high number of reads tag into introns. I was wondering why I have a high number of reads tag across intron. This result can be due to the type of sequencing?
Looking at htseq-count output, I observed many reads counted as no_features. For example in one sample over a total of 3,430,447 deduplicated reads, I have 1,911,232 counts assigned to features and 1,492,974 counts considered as __no_feature. I was wondering why I have a high number of counts considered as no_features and if it can indicate some issues about the analysis and sequencing.
In addition, there is a minimum of reads that should be assigned to the features to perform gene expression analysis. For example, 877,152 counts assigned to features can be enough?
Thank you!
Concetta
Did you follow instructions Lexogen has for processing data produced using this kit?
Yes, I have followed their instuctions.
Have you inspected the alignments to verify that the reads are indeed in introns? Are there specific pileups (those could be previously unknown genes/non-coding RNAs) or general scatter of alignments (low level DNA contamination)?
Yes, I checked on IGV the reads aligment. I observed that the reads are spread over introns and intergenic regions. I checked the region in the UCSC genome browser and in the regions there are annotated transposons element.
I can exclude the DNA contamination because during RNA extraction DNase treatment has been performed.
I have another question concerning the minimum number of reads mapped to genes. Is 1 million of mapped reads to genes enough to perform gene expression analysis? For example, 877,152 reads mapped to genes are enough to perform expression analysis?
Hi, is this the Single stranded library? have you checked that you used the correct strandness? I am not sure but I thing you should have used the -fr-secondstrand
I am seeing similar issues in FFPE samples. Did you ever resolve this or is this expected in heavily fragmented RNA?