Entering edit mode
6.6 years ago
Vasu
▴
790
Hi,
With Rna-seq data I have used hisat2 for aligning reads to genome. To check the mapping percentage I used samtools flagstat sample.sorted.bam
When I looked at few samples I see big number of reads passed and mapped.
Samples Total_reads Mapped_reads
Sample1 1076.13 M reads 1035.27 M reads
Sample2 1273.74 M reads 1175.18 M reads
Sample3 768.89 M reads 760.35 M reads
Sample4 105.84 M reads 101.81 M reads
Sample5 506.19 M reads 496.32 M reads
1) Is that big number reads mapped for few samples common? Why I see very high mapping percentage? Libraries were generated with Ribosomal rna depletion kit.
2) For few samples I see almost 45k genes among 53k showing 0 read counts. Don't know why.
3) I would also like to check reads mapped to ribosomal genes or ribosomal RNA's. Map I know how to check that?
When I looked at counts_matrix using featureCounts, I see around 540 rRNA genes based on gene_type among 58k genes. All these rRNA genes have 0,1,10 counts for all the samples. As I see rRNA genes, does it mean some contamination happened while generating libraries?
Ribodepletion may be imperfect at times and may not remove entire rRNA present. Depending on aligner you used (and how you chose to handle the multi-mappers) you may have more or less counts for rRNA. You could of course omit considering those counts in downstream analysis but not having any counts associated with 45K genes (out of 53K) sounds a bit odd.
What genome are these samples from? How many reads did you have per sample to begin with?
It is human genome. 45k genes among 58K showed 0 counts only for couple of samples. For the rest of samples I see atleast 15-20k genes with 0 counts. Total reads per sample is given above in the question.
Ah I see. The reads are extremely unbalanced. A billion plus for two samples and only a tenth for another. Still not having any counts for 30-40% of genes seems a rather high number to me.
Have you examined the alignments to see if they look reasonable? If you have some idea about the experiment do basic visual QC checks out (e.g. a deletion has no counts associated with it, up-regulation where you expect it, no pileups outside of exons etc).
Yes I did QC check. Also mapping quality after alignment. Mean quality score, per seq quality score shows positive results. For per base seq content I saw a bias towards 3' end and so I trimmed 3 bases on the 3' end during alignment. For Per seq GC content all the samples show a failure results. There are spikes instead of normal distribution for all the samples. No, Adapter content.
What I should do now?
Without an understanding of what this experiment is about it may be hard for anyone here to offer constructive suggestions.
BTW; I was referring to post-alignment QC checks. You are mostly referring to the pre-alignment QC.
This experiment is mainly to get lncRNAs and to study them. For post alignment QC check what to do?
Use Qualimap on your BAM files.
Yes, I did rseqc (mapping statistics) for one of the sample.
~50% of reads are multi-mapping in this example. Are you counting
lncRNA's
or trying to discover new ones?counting lncRnas and pc genes. will do downstream analysis. Do u think it is fine from the above statistics? Also would like to discover new lncRNAS.
Hi, I'm using qualimap bamqc with this command
qualimap bamqc -bam *.sorted.bam -p strand-specific-reverse -gff annotation.gtf -outformat HTML --java-mem-size=4G
.For few samples it gave the output. But suddenly for one of the sample this is what I see
Selected tool: bamqc Available memory (Mb): 32 Max memory (Mb): 3817 Starting bam qc.... Loading sam header... Loading locator... Loading reference... Number of windows: 400, effective number of windows: 593 Chunk of reads size: 1000 Number of threads: 20 Initializing regions from annotation.gtf..... Found 2617914 regions Filling region references... Processed 50 out of 593 windows... Processed 100 out of 593 windows... Processed 150 out of 593 windows... Processed 200 out of 593 windows...
WARNING: out of memory! Qualimap allows to set RAM size using special argument: --java-mem-size Check more details using --help command or read the manual.
What I should do now? Could you please also answer to my previous comment.
Hi could you please answer to my above comments
2.What are those genes without read counts ? Sometimes very small ncRNA such as miRNAs and pirRNAs are in the annotation but not well captured in total RNA-seq. So for those classes of RNA, it should be ok to have 0 read counts. On the other hand, if you have 0 read count for most protein coding genes, then something went wrong (probably).
There are very less protein coding genes with 0 counts among all the samples. Most of the genes with 0 counts are rRNA, snRNA etc...