Hi
By this code I have obtained my raw read counts from RNA-seq
Alignment by STAR
STAR --genomeDir ./STAR_hg38_Genome --readFilesIn $fastq1 $fastq2
Sorting sam
samtools view -h -o aligned.sam sort.bam
Bam 2 Sam
samtools view -h -o aligned.sam ligned.sorted.bam
GeneCount
htseq-count --stranded=no -q aligned.sam ./gtf > counts.txt
But I have this results; Too many reads non uniquely mapped
__no_feature 427159
__ambiguous 1250237
__too_low_aQual 0
__not_aligned 0
__alignment_not_unique 10648052
I tried to summarized raw read counts by Featurecounts I saw rate of summarization was low
Load annotation file Homo_sapiens.GRCh38.dna.primary_assembly.gtf ... ||
|| Features : 1262162 ||
|| Meta-features : 58735 ||
|| Chromosomes/contigs : 47 ||
|| ||
|| Process SAM file aligned.sam... ||
|| Paired-end reads are included. ||
|| Assign alignments to features... ||
|| Total alignments : 47361817 ||
|| Successfully assigned alignments : 23092843 (48.8%) ||
|| Running time : 1.34 minutes
Should I worry about htseq results? You think problem comes from where? When I tried to sort sam by picard I saw less alignment_not_unique, Does it make any sense?
What is the read length?
Sorry how I should know the read length? Yes my ultimate goal is differential expression analysis
Just do
head your.fastq
and count how long the read in line 1 is. If the fastq is compressed dozcat your.fq.gz | head
Thank you
Ok, and the reverse read, so
2.fastq
?Ok, so normal 2x75bp. I was asking because overly short reads like 2x25 often lead to multimappers but in this case it probably comes down to what swbarnes2 says, having rRNA or similar contaminations. I would not worry too much about this. Proceed with the normal analysis, use something like PCA to check for potential batch effects (DESeq2 vignette has good manual on how to do that) and then see if results are reasonable after DEG.
For dif. gene expression analysis (I assume you do it because you count the reads), multimappers are usually not counted, though having high proportion of multimapped reads is not desired. What is the read length of your data? STAR has several options (e.g. --outFilterMismatchNmax) to restrict the number of mismatches to the reference, which can influence the mapping rates.
Additionally, the overlap resolution mode from htseq can influence your final read count stats, chech the docs to see which mode better fits your data.